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Abstract 

Individual decision-makers consume information revealed by the previous decision makers, and pro¬ 
duce information that may help in future decisions. This phenomenon is common in a wide range of 
scenarios in the Internet economy, as well as in other domains such as medical decisions. Each decision¬ 
maker would individually prefer to exploit: select an action with the highest expected reward given 
her current information. At the same time, each decision-maker would prefer previous decision-makes 
to explore, producing information about the rewards of various actions. A social planner, by means 
of carefully designed information disclosure, can incentivize the agents to balance the exploration and 
exploitation so as to maximize social welfare. 

We formulate this problem as a multi-armed bandit problem (and various generalizations thereof) un¬ 
der incentive-compatibility constraints induced by the agents’ Bayesian priors. We design an incentive- 
compatible bandit algorithm for the social planner whose regret is asymptotically optimal among all 
bandit algorithms (incentive-compatible or not). Further, we provide a black-box reduction from an ar¬ 
bitrary multi-arm bandit algorithm to an incentive-compatible one, with only a constant multiplicative 
increase in regret. This reduction works for very general bandit setting that incorporate contexts and 
arbitrary auxiliary feedback. 


*An extended abstract of this paper has been published in ACM EC 2015 (16th ACM Conf. on Economics and Computation). 
Compared to the version in conference proceedings, this version contains complete proofs, revamped introductory sections, and 
thoroughly revised presentation of the technical material. Further, two major extensions are fleshed out, resp. to more than two 
actions and to more general machine learning settings, whereas they were only informally described in the conference version. 
The main results are unchanged, but their formulation and presentation is streamlined, particularly regarding assumptions on the 
common prior. This version also contains a discussion of potential applications to medical trials. 
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1 Introduction 


Decisions made by an individual often reveal information about the world that can be useful to others. For 
example, the decision to dine in a particular restaurant may reveal some observations about this restaurant. 
This revelation could be achieved, for example, by posting a photo, tweeting, or writing a review. Others can 
consume this information either directly (via photo, review, tweet, etc.) or indirectly through aggregations, 
summarizations or recommendations. Thus, individuals have a dual role: they both consume information 
from previous individuals and produce information for future consumption. This phenomenon applies very 
broadly: the choice of a product or experience, be it a movie, hotel, book, home appliance, or virtually any 
other consumer’s choice, leads to an individual’s subjective observations pertaining to this choice. These 
subjective observations can be recorded and collected, e.g., when the individual ranks a product or leaves a 
review, and can help others make similar choices in similar circumstances in a more informed way. Collect¬ 
ing, aggregating and presenting such observations is a crucial value proposition of numerous businesses in 
the modern Internet economy, such as TripAdvisor, Yelp, Netflix, Amazon, Waze and many others. Similar 
issues, albeit possibly with much higher stakes, arise in medical decisions: selecting a doctor or a hospital, 
choosing a drug or a treatment, or deciding whether to participate in a medical trial. First the individual can 
consult information from similar individuals in the past, to the extent that such information is available, and 
later he can contribute her experience as a review or as an outcome in a medical trial. 

If a social planner were to direct the individuals in the information-revealing decisions discussed above, 
she would have two conflicting goals: exploitation, choose the best alternative given the information avail¬ 
able so far, and exploration, trying out less known alternatives for the sake of gathering more information, 
at the risk of worsening the individual experience. A social planner would like to combine exploration and 
exploitation so as to maximize the social welfare, which results in the exploration-exploitation tradeoff, a 
well-known subject in Machine Learning, Operation Research and Economics. 

However, when the decisions are made by individuals rather than enforced by the planner, we have 
another problem dimension based on the individuals’ incentives. While the social planner benefits from 
both exploration and exploitation, each individuals’ incentives are typically skewed in favor of the latter. (In 
particular, many people prefer to benefit from exploration done by others.) Therefore, the society as a whole 
may suffer from insufficient amount of exploration. In particular, if a given alternative appears suboptimal 
given the information available so far, however sparse and incomplete, then this alternative may remain 
unexplored - even though in reality it may be the best. 

The focus of this work is how to incentivize self-interested decision-makers to explore. We consider a 
social planner who cannot control the decision-makers, but can communicate with them, e.g., recommend 
an action and observe the outcome later on. Such a planner would typically be implemented via a web¬ 
site, either one dedicated to recommendations and feedback collection (such as Yelp or Waze), or one that 
actually provides the product or experience being recommended (such as Netflix or Amazon). In medi¬ 
cal applications, the planner would be either a website that rates/recommends doctors and collects reviews 
on their services (such as rateMDs.com or SuggestADoctor.com), or an organization conducting a medical 
trial. We are primarily interested in exploration that is efficient from the social planner’s perspective, i.e., 
exploration th at optimizes the soci al welfareQ 

Following Kremer et al. ( 2014ll . we consider a basic scenario when the only incentive offered by the so¬ 
cial planner is the recommended experience itself (or rather, the individual’s belief about the expected utility 
of this experience). In particular, the planner does not offer payments for following the recommendation. 


*In the context of Internet economy, the “planner” would be a for-profit company. Yet, the planner’s goal, for the purposes of 
incentivinzing exploration, would typically be closely aligned with the social welfare. 
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On a technical level, we study a mechanism design problem with an explore-exploit tradeoff and auxiliary 
incentive-compatibility constraints. Absent these constraints, our problem reduces to multi-armed bandits 
(MAB) with stochastic rewards, the paradigmatic and well-studied setting for exploration-exploitation trade¬ 
offs, and various generalizations thereof. The intera ction between the planner and a single agent can be 
viewed as a version of the Bayesian Persuasion game (IKamenica and GentzkowLl201lh in which the planner 
has more information due to the feedback from the previous agents; in fact, this information asymmetry is 
crucial for ensuring the desired incentives. 


1.1 Our model and scope 

We consider the following abstract framework, called incentive-compatible exploration. The social planner 
is an algorithm that interacts with the self-interested decision-makers (henceforth, agents) over time. In each 
round, an agent arrives, chooses one action among several alternatives, receives a reward for the chosen 
action, and leaves forever. Before an agent makes her choice, the planner sends a message to the agent 
which includes a recommended action. Everything that happens in a given round is observed by the planner, 
but not by other agents. The agent has a Bayesian prior on the reward distribution, and chooses an action 
that maximizes its Bayesian expected reward given the algorithm’s message (breaking ties in favor of the 
recommended action). The agent’s prior is known to the planner, either fully or partially. We require the 
planner’s algorithm to be Bayesian incentive-compatible (henceforth, BIC), in the sense that each agent’s 
Bayesian expected reward is maximized by the recommended action. The basic goal is to design a BIC 
algorithm so as to maximize social welfare, i.e., the cumulative reward of all agents. 

The algorithm’s message to each agent is restricted to the recommended action (call such algorithms 
message-restricted). Any BIC algorithm can be turned into a message-restricted BIC algorithm which 
chooses the same actions, as long as the agents’ priors are exactly known to the plannero Note that a 
message-restricted algorithm (BIC or not) is simply an MAB-Iike learning algorithm for the same setting. 

A paradigmatic example is the setting where the reward is an independent draw from a distribution 
determined only by the chosen action. All agents share a common Bayesian prior on the reward distribution; 
the prior is also known to the planner. No other information is received by the algorithm or an agent 
(apart from the prior, the recommended action, and the reward for the chosen action). We call this setting 
BIC bandit exploration. Absent the BIC constraint, it reduces to the MAB problem with IID rewards and 
Bayesian priors. We also generalize this setting in several directions, both in terms of the machine leai'ning 
problem being solved by the planner’s algorithm, and in terms of the mechanism design assumptions on the 
information structure. 

We assume that each agent knows which round he is arriving in. A BIC algorithm for this version is also 
BIC for a more general version in which each agent has a Bayesian prior on his round. 

Discussion. BIC exploration does not rely on “external” incentives such as monetary payments or discounts, 
promise of a higher social status, or people’s affinity towards experimentation. This mitigates the potential 
for selection bias, when the population that participates in the experiment differs from the target population. 
Indeed, paying patients for participation in a medical trial may be more appealing to poorer patients; offering 
discounts for new services may attract customers who are more sensitive to such discounts; and relying on 
people who like to explore for themselves would lead to a dataset that represents this category of people 
rather than the general population. While all these approaches are reasonable and in fact widely used (with 
well-developed statistical tools to account for the selection bias), an alternative intrinsically less prone to 
selection bias is, in our opinion, worth investigating. 

^This is due to a suitable version of the “revelation principle”, as observed in iKremer et all ll2014h . 
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The “intrinsic” incentives offered by BIC exploration can be viewed as a guarantee of fairness for the 
agents: indeed, even though the planner imposes experimentation on the agents, the said experimentation 
does not degrade expected utility of any one agent. (This is because an agent can always choose to ignore 
the experimentation and select an action with the highest prior mean reward.) This is particularly important 
for settings in which “external” incentives described above do not fully substitute for the intrinsic utility of 
the chosen actions. For example, a monetary payment does not fully substitute for an adverse outcome of a 
medical trial, and a discounted meal at a restaurant does not fully substitute for a bad experience. 

We focus on message-restricted algorithms, and rely on the BIC property to convince agents to follow 
our recommendations. We do not attempt to make our recommendations more convincing by revealing 
additional information, because doing so does not help in our model, and because the desirable kinds of 
additional information to be revealed are likely to be application-specific (whereas with message-restricted 
algorithms we capture many potential applications at once). Further, message-restricted algorithms are 
allowed, and even recommended, in the domain of medical trials (see Section [T4l for discussion). 

Objectives. We seek BIC algorithms whose performance is near-optimal for the corresponding setting with¬ 
out the BIC constraint. This is a common viewpoint for welfare-optimizing mechanism design problems, 
which often leads to strong positive results, both prior-independent and Bayesian, even if Bayesian-optimal 
BIC algorithms are beyond one’s reach. Prior-independent guarantees are particularly desirable because 
priors are almost never completely correct in practice. 

We express prior-independent performance of an algorithm via a standard notion of regret: the dif¬ 
ference, in terms of the cumulative expected reward, between the algorithm and the the best fixed action. 
Intuitively, it is the extent to which the algorithm “regrets” not knowing the best action in advance. For 
Bayesian performance, we consider Bayesian regret: ex-post regret in expectation over the prior, and also 
the average Bayesian-expected reward per agent. (For clarity, we will refer to the prior-independent version 
as ex-post regret.) Moreover, we consider a version in which the algorithm outputs a prediction after each 
round (visible only to the planner), e.g., the predicted best action; then we are interested in the rate at which 
this prediction improves over time. 

1.2 Our contributions 

On a high level, we make two contributions: 

Regret minimization We provide an algorithm for BIC bandit exploration whose ex-post regret is asymp¬ 
totically optimal among all MAB algorithms (assuming a constant number of actions). Our algorithm 
is detail-free, in that it requires very limited knowledge of the prior. 

Black-box reduction We provide a reduction from an arbitrary learning algorithm to a BIC one, with only 
a minor loss in performance; this reduction “works” for a very general BIC exploration setting. 

In what follows we discuss our results in more detail. 

Regret minimization. Following the literature on regret minimization, we focus on the asymptotic ex-post 
regret rate as a function of the time horizon (which in our setting corresponds to the number of agents). 

We establish that the BIC restriction does not affect the asymptotically optimal ex-post regret rate for a 
constant number of actions. The optimality is two-fold: in the worst case over all realizations of the common 
prior (i.e., for every possible vector of expected rewards), and for every particular realization (which may 
allow much smaller ex-post regret than in the worst-case). 
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More formally, if T is the time horizon and A is the “gap” in expected reward between the best action 
and the second-best action, then we achieve ex-post regret 

cv log(T) -h CO • min(^ logT, ^/TIo^), 

where cq is an absolute constant, and c-p is a constant that depends only on the common prior V. A well- 
known lower bound states that one cannot achieve ex-post regret better than 0(min('^ log T, \/T))J1 

Conceptually, our algorithm implements adaptive exploration: the exploration schedule is adapted to the 
previous observations, so that the exploration of low-performing actions is phased out early. This is known 
to be a vastly superior approach compared to exploration schedules that are fixed in advance; in particular, 
the latter approach yields a much higher ex-post regret, both per-realization and in the worst case. 

Further, our algorithm is detail-free, requiring very little knowledge of the common prior. This is de¬ 
sirable because in practice it may be complicated or impossible to elicit the prior exactly. Moreover, this 
feature allows the agents to have different priors (as long as they are “compatible” with the planner’s prior, 
in a precise sense specified lafer). In facf, an agenf does nof even need fo know her prior exacfly: insfead, 
she would frusf fhe planner’s recommendation as long as she believes fhaf fheir priors are compafible. 

Black-box reduction. Given an arbifrary MAB algorifhm A, we provide a BIC algorifhm A^'^ which 
infernally uses .A as a “black-box”. Thai is, A^^ simulales a run of A, providing inpuls and recording fhe 
respeclive oufpuls, bul does nof depend on fhe infernal workings of A. In addition fo recommending an 
action, fhe original algorifhm A can also oufpuf a prediction after each round (visible only fo fhe planner), 
e.g. fhe predicled besl aclion; Ihen A^^ oufpuls a prediclion, loo. A reduclion such as ours allows a modular 
design: one can design a non-BIC algorifhm (or lake an exisling one), and Ihen use fhe reduclion fo injecf 
incenlive-compalibilily. Modular designs are very desirable in complex economic syslems, especially for 
sellings such as MAB wilh a rich body of exisling work. 

Our reduction incurs only a small loss in performance, which can be quantified in several ways. In ferms 
of Bayesian regref, fhe performance of A^'^ worsens by al mosf a conslanl mulliplicalive faclor fhaf only 
depends on fhe prior. In terms of fhe average rewards, we guaranlee fhe following: for any duration T, fhe 
average Bayesian-expecled reward of A^'^ befween rounds cp and c-p -|- T is af leas! fhaf of fhe firs! T/Lp 
rounds in fhe original algorifhm A', here cp and Lp are prior-dependenf conslanfs. Finally, if A oufpuls a 
prediction ft after each round t, then A^'^ learns as fast as A, up to a prior-dependent constant factor cp: for 
every realization of the prior, its prediction in round t has the same distribution as 

The black-box reduction has several benefits other than “modular design”. Most immediately, one can 
plug in an MAB algorithm that takes into account the Bayesian prior or any other auxiliary information 
that a planner might have. Moreover, one may wish to implement a particular approach to exploration, e.g., 
incorporate some constraints on the losses, or preferences about which arms to favor or to thi'ottle. Further, 
the planner may wish to predict things other than the best action. To take a very stark example, the planner 
may wish to learn what are the worst actions (in order to eliminate these actions later by other means such 
as legislation). While the agents would not normally dwell on low-performing actions, our reduction would 
then incentivize them to explore these actions in detail. 

Beyond BIC bandit exploration. Our black-box reduction supports much richer scenarios than BIC bandit 
exploration. Most importantly, it allows for agent heterogeneity, as expressed by observable signals. We 

’’More precisely: any MAB algorithm has ex-post regret at least n ( min( A log T, Vt )) in the worst case over all MAB instances 
with two actions, time horizon T and gap A jLai and RobbinsL|l985l : lAuer et alll2002bll . 

''So if the original algorithm A gives an asymptotically optimal error rate as a function of t, compared to the “correct” prediction, 
then so does the transformed algorithm up to a prior-dependent multiplicative factor. 
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adopt the framework of contextual bandits, well-established in the Machine Learning literature (see Sec- 
tion|2]for citations). In particular, each agent is characterized by a signal, called context, observable by both 
the agent and the planner before the planner issues the recommendation. The context can include demo¬ 
graphics, tastes, preferences and other agent-specific information. It impacts the expected rewards received 
by this agent, as well as the agent’s beliefs about these rewards. Rather than choose the best action, the 
planner now wishes to optimize a policy that maps contexts to actions. This type of agent heterogeneity 
is practically important: for example, websites that issue recommendations may possess a huge amount of 
information about their customers, and routinely use this “context” to adjust their recommendations (e.g., 
Amazon and Netflix). Our reduction turns an arbitrary contextual bandit algorithm into a BIC one, with 
performance guarantees similar to those for the non-contextual version. 

Moreover, the reduction allows learning algorithms to incorporate arbitrary auxiliary feedback that 
agents’ actions may reveal. For example, a restaurant review may contain not only the overall evalua¬ 
tion of the agent’s experience (i.e., her reward), but also reveal her culinary preferences, which in turn may 
shed light on the popularity of other restaurants (i.e., on the expected rewards of other actions). Further, 
an action can consist of multiple “sub-actions”, perhaps under common constraints, and the auxiliary feed¬ 
back may reveal the reward for each sub-action. For instance, a detailed restaurant recommendation may 
include suggestions for each course, and a review may contain evaluations thereof. Such problems (without 
inventive constraints) have been actively studied in machine leai'ning, under the names “MAB with partial 
monitoring” and “combinatorial semi-bandits; see Section|2] for relevant citations. 

In particular, we allow for scenarios when the planner’s utility is misaligned with the agents’ (and ob¬ 
served by the algorithm as auxiliary feedback). For example, a vendor who recommends products to cus¬ 
tomers may wish to learn which product is best for him, but may have a different utility function which 
favors more expensive products or products that are tied in with his other offerings. Another example is a 
medical trial of several available immunizations for the same contagious disease, potentially offering dif¬ 
ferent tradeoffs between the strength and duration of immunity and the severity of side effects. Hoping to 
free-ride on the immunity of others, a patient may assign a lower utility to a successful outcome than the 
government, and therefore prefer safer but less efficient options. 

A black-box reduction such as ours is particularly desirable for the extended setting described above, es¬ 
sentially because it is not tied up to a particular variant of the problem. Indeed, contextual bandit algorithms 
in the literature heavily depend on the class of policies to optimize over, whereas our reduction does not. 
Likewise, algorithms for bandits with auxiliary feedback heavily depend on the particular kind of feedback. 

1.3 A technical discussion 

Our techniques. An essential challenge in BIC exploration is to incentivize agents to explore actions that 
appear suboptimal according to the agent’s prior and/or the information currently available to the planner. 
The desirable incentives are created due to information asymmetry, the planner knows more than the agents 
do, and the recommendation reveals a carefully calibrated amount of additional information. The agent’s 
beliefs are then updated so that the recommended action now seems preferable to others, even though the 
algorithm may in fact be exploring in this round, and/or the prior mean reward of this action may be small. 

Our algorithms are based on (versions of) a common building block: an algorithm that incentivizes 
agents to explore at least once during a relatively short time interval (a “phase”). The idea is to hide one 
round of exploration among many rounds of exploitation. An agent receiving a recommendation does not 
know whether this recommendation corresponds to exploration or to exploitation. However, the agents’ 
Bayesian posterior favors the recommended action because the exploitation is much more likely. Information 
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asymmetry arises because the agent cannot observe the previous rounds and the algorithm’s randomness. 

To obtain BIC algorithms with good performance, we overcome a number technical challenges, some 
of which are listed below. First, an algorithm needs to convince an agent not to switch to several other 
actions: essentially, all actions with larger prior mean reward than the recommended action. In particular, 
the algorithm should accumulate sufficiently many samples of these actions beforehand. Second, we ensure 
that phase length — i.e., the sufficient size of exploitation pool — does not need to grow over time. In par¬ 
ticular, it helps not to reveal any information to future agents (e.g., after each phase or at other “checkpoints” 
throughout the algorithm). Third, for the black-box reduction we ensure that the choice of the bandit algo¬ 
rithm to reduce from does not reveal any information about rewards. In particular, this consideration was 
essential in formulating the main assumption in the analysis (Property (iQ on page [17]). Fourth, the detail- 
free algorithm cannot use Bayesian inference, and relies on sample average rewards to make conclusions 
about Bayesian posterior rewards, even though the latter is only an approximation for the former. 

The common prior. Our problem is hopeless for some priors. For a simple example, consider a prior on 
two actions (whose expected rewards are denoted /ri and ^ 2 ) such that E[/ii] > E[/r 2 ] and is statistically 
independent from /ri — 112 - Then, since no amount of samples from action 1 has any bearing on /ri — /X 2 , 
a BIC algorithm cannot possibly incentivize agents to try action 2. To rule out such pathological examples, 
we make some assumptions. Our detail-free result assumes that the prior is independent across actions, and 
additionally posits minor restrictions in terms of bounded rewards and full support. The black-box reduction 
posits an abstract condition which allows for correlated priors, and includes independent priors as a special 
case (with similar minor restrictions). 

Map of the technical content. We discuss technical preliminaries in Sections [3] The first technical result 
in the paper is a BIC algorithm for initial exploration in the special case of two arms (Section ID), the most 
lucid incarnation of the “common building block” discussed above. Then we present the main results for BIC 
bandit exploration: the black-box reduction (Section [S]) and the detail-free algorithm with optimal ex-post 
regret (Section^. Then we proceed with a major extensions to contexts and auxiliary feedback (SectionjTj). 
The proofs pertaining to the properties of the common prior are deferred to Section [H Conclusions and open 
questions are in Section |9l The detail-free algorithm becomes substantially simpler for the special case of 
two actions. For better intuition, we provide a standalone exposition of this special case in Appendix |A| 


1.4 Further discussion on medical trials 


We view patients’ incentives as one of the major obstacles that inhibit medical trials in practice, or prevent 
some of them from happening altogether. This obstacle may be particularly damaging for large-scale trials 
that concern wide-spread medical conditions with relatively inexpensive treatments. Then finding suifable 
pafienfs and providing fhem wifh appropriafe freafmenfs would be fairly realisfic, buf incenfivizing pafienfs 
fo parficipafe in sufficienl numbers may be challenging. BIC explorafion is fhus a fheorefical (and so far, 
highly idealized) affempf fo mifigafe fhis obsfacle. 

Medi cal frials has been one of fhe orig inal mofivafions for sfudying MAB and explorafion-exploifafion 


tradeoff (iThompsonL Il933 


Gittinsl Il979h. Bandit- like designs for medical trials belong to the realm of 


adaptive medical trials fsee lChow and Chand. 120081 for background), which also include other “adaptive” 
features such as early stopping, sample size re-estimation, and changing the dosage. 

“Message-restricted” algorithms (which recommend particular treatments to patients and reveal no other 
information) are appropriate for this domain. Revealing some (but not all) information about the medical 
trial is re quired to meet the sta ndards of “informed consent”, as prescribed by various guidelines and regula¬ 
tions fsee lArango et al.L 120121 for background). However, revealing information about clinical outcomes in 
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an ongoing trial is currently not required, to the best of our understanding. In fact, revealing such informa¬ 
tion is seen as a significant threat to the statistical validity of the trial (because both patients and doctors may 
become biased in favo r of better-perform ing treatments), and care is advised to prevent information leaks as 
much as possible (see IPetrv et al.l. 120121 for background and discussion, particularly pp. 26-30). 

Medical trials provide additional motivation for BIC bandit exploration with multiple actions. While 
traditional medical trials compare a new treatment against the placebo or a standard treatment, designs of 
medical trials with multip le treatments have been studied in the biostatistics lite rature (e.g., seelHellmich . 
2001 : Freidlin et al.l 20081. and are becoming increasingly important in practice (Parmar et ah . 2014 : Redig 


and Janne, l2015li . Note that even for the placebo or the standard treatment the expected reward is often not 


known in advance, as it may depend on the particular patient population. 

BIC contextual bandit exploration is particularly relevant to medical trials, as patients come with a lot of 
“context” which can be used to adjust and personalize the recommended treatment. The context can include 
age, fitness levels, race or ethnicity, various aspects of the patient’s medical history, as well as genetic 
markers (increasingly so as genetic sequencing is becoming more available). Context-dependent treatments 



the deployed designs are explicitly “contextua l”, in that they seek the best policy — mapping from patient’s 


context to treatment ( Redig and Jannel l2015h. More advanced “contextua l” designs have been studied in 
biostatistics (e.g., see 


Freidlin and SimotT 2005 : Freidlin et al. . 2007 . 2O10ll . 


2 Related work 


Technical co mparison tolKremer. Mansoiir. and PerrvI (l2014h . The setting of BIC bandit exploration was 
introduced in Kremer et al. (l2014h . While the expected re ward is determined by the chosen action, we allow 
the realized reward to be stochastic. iKremer et al.l (l2014h mainly focuses on deriving the Bayesian-optimal 
policy for the case of only two actions and deterministic rewards, and only obtains a preliminary result for 
stochastic rewards . We i mprove over the latter result in several ways, detailed below. 

Kremer et al. (12014t) only consider the case of two actions, whereas we handle an arbitrary constant 
number of actions. Handling more than two actions is important in practice, because recommendation 
systems for products or experiences are ra rely faced with only binary decision s Further, medical trials 


with multiple treatments are important, too (lHellmichLl200ll:lFreidlin et al.L l2008l: iParmar et al.L 1201411 . For 


multiple actions, convergence to the best action is a new and interesting result on its own, even regardless of 
the rate of convergence. Our extension to more than two actions is technically challenging, requiring several 
new ideas compared to the two-action case; especially so for the detail-free version. 


We implement adaptive exploration, as discussed above, whereas the algorithm in iKremer et all (120141) 
is a BIC implementation of a “naive” MAB algorithm in which the exploration schedule is fixed in advance. 
This leads fo a sfark difference in ex-posf regref. To describe fhese improvemenfs, lef us define fhe MAB in¬ 
stance as a mapping from acfions fo fheir expecfed rewards, and lef us be more ex plicif abouf fhe asymp fofic 
ex-posf regref rafes as a function of fhe fime horizon T (i.e., fhe number of agenfs). iKremer et al.l (12014^ only 


































































































provides regret bound of 0(T^/^) for all MAB instances!! whereas our algorithm achieves ex-post regret 
0{VT) for all MAB instances, and polylog(T) for MAB instances with constant “gap” in the expected 
reward between the best and the second-best action. The literature on MAB considers this a significant 
improvement (more on this in the related work section). In particular, the polylog(T) result is important 
in two ways: it quantifies the advantage of “nice” MAB inst ances over the worst case, and of IID rewards 
over adversarially chosen rewards]! The sub-par regret rate in Kremer et al. ( 2014li is indeed a consequence 
of fixed exploration: fixed-exploration algorithms cannot have worst-ca se regret better than and 


cannot achieve polylog(T) regret for MAB instances with constant gap (Babaioff et al. . 2014i) . 


In terms of information structure, the algorithm in iKremer et al.l (120141) requires all agents to have the 
same prior, and requires a very detailed knowledge of that prior; both are significant impediments in practice. 
Whereas our detail-free result allows the planner to have only a very limited knowledge of the prior, and 
allows the agents to have differe nt priors. 

Finally, iKremer et al.l (120141) does not provide an analog of our black-box reduction, and does not handle 
the various generalizations of BIC bandit exploration that the reduction supports. 

Explor ation by self-interes ted agents. The study of mechanisms to incentivize exploration has been initi¬ 
ated by IKremer et al.l (12014 ), see the comparison above. Motivated by the same applications in the Internet 
economy. Iche and Horner ( 2013 ) propose a model with a continuous information flow and a continuum of 
consumers arriving to a recommendation system and derive a Bayesian-optimal incentive-compatible pol¬ 
icy. Their model is technically different from ours, and is restricted to two arms and binary rewards. Frazier 


et al. (12014l) consider a similar setting with monetary transfers, where the planner not only recommends an 
action to each agent, but also offers a payment for taking this action. In their setting, incentives are created 
viatheofferedpayi^n ts rath er th an via inforrnation a symmetry. 

Bolton and Harris (1 19991) and iKeller et al.l (l2005l) consider settings in which agents can engage in ex¬ 
ploration and can benefit from exploration performed by other agents. Unlike our setting, the agents are 
long-liyedfmesei^ or man y rounds), and there is no planner to inc entivize efficie nt exploration. 

Rayo and Segal ( 2010l) . Kamenica and Gentzkow ( 2011 ) and Manso ( 2011 ) examine related settings 
in which a planner incentivizes agents to make better choices, either via infor mati on disclosure or via a 
contra ct. All three papers focus on settings with only two rounds. Ely et al.l(l2015l) and iHorner and Skrzvpacz 
(l201.5h consider information disclosure over time, in very different models: resp., releasing news over time 
to optimize suspense and surprise, and selling information over time. 

Multi-armed bandits. Multi-armed bandits (MAB) haye been studied extensively in Economics, Opera¬ 
tions Research and Computer Science since (iThompsonl Il933l) . Motivations and applications range from 
medical trials to pricing and invento ry optimization to driving directi ons t o online adve r tising to human 
computation. A reader may refer to ( Bubeck and Cesa-Bianchi . 20121) and ( Gittins et al. . 2011 ) for back¬ 
ground on regret-minimizing and Bayes ian formulations, respectiyely. F urther background on related ma¬ 
chine learning problems can be found in ICesa-Bianchi and Lugosil (l2006l). Our results are prirnarily re lated 
to regret-minimizing MAB formulations with IID rewards dLai and RobbinsLll985l:lAuer et al.Ll2002al) . 

Our detail-free algorithm builds on Hoeffding races dMaron and Moorel 1 19931 1 19971) . a well-known 
technique in reinforcement l earnin g. Its incarnation in the context of MAB is also known as active arms 
elimination dEven-Dar et al.l l2006l) . 

The general setting for the black-box reduction is closely related to three prominent directions in the 
work on MAB: contextual bandits, MAB with partial monitoring, and MAB with budgeted exploration. 


^Here and elsewhere, the O(-) notation hides polylogfT) factors. 

®The MAB prohlem with adversarially chosen rewards only admits 0(Vt) ex-post regret in the worst case. 
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Con textual bandits have been introduced, under various names and models, in ( Woodroof3. 19791: Auer 


et al., 2002b : Wanp et al. . 2005 : Langford am ZhangL 2007ll . and actively studied since t hen. We follow the 
formulation from (ILangford and Zhand. 120071) and a long line of subsequent work (e.g.. lDudik et alll201 1 : 
Agarwal et al.Ll2014h . In MAB with partial monitoring, auxiliary feedba ck is revealed in each round alon g 
wit h the reward. A historical ac c ount for this directi on can be found in ICesa-Bianchi and Lugosil (l2006h : 


see 


Audibert and Bubeck ( 201(]li : Bartok et al. (2014) for examples of more recent progress. An important 
special case of MAB with partial monitoring is “combinatorial semi-bandits”, where in each round the 
algorithm chooses a subset from s ome f i xed ground set, and th e reward for each c hosen element is revealed 


(iGvorgy et al.L 120071 : iKale et al.L 1201 Ol : lAudibert et al.L 1201 ll : IWen et al.L 120151) . In MAB with budgeted 


exploration, algorithm’s goal is to predict the best arm in a given number o f rounds, and performance is 


measured via prediction quality rather than cumulat i ve reward (lEyen-Dar et al 
2004; Guha and Munagala . 2007; Goel et al . 20091: Bubeck et al.Ll201 ilk 


2 OO 2 I : iMannor and Tsitsiklis . 


Improving the asymptotic regret rate from 0{T'^), 7 > ^ to 0{VT) and from 0{y/T) to polylog(T) 
has been a dominant theme in the literature on regret-minimizing MAB (a thr'ough survey is beyond our 
scope, see lBubeck and Cesa-Bianchi ( 2012 ) for background and citations). In particular, the improvement 


from (9(T^/^) t o 0(VT) regret due to the distinction between f i xed and adaptive exp loration has been a 
major theme in dBabaioff et al. . 2014 : Devanur and Kakade . 2009; Babaioff et al. . 2012 ). 


pricing with model uncertainty fe.g.. Kleinberg and Leightonl 2003: Besbes and Zeevi. 

2009 

Badanidivuru 

et al.. 2013 b. dynamic auctions (e.g., Athev and Segal. 20131: Bergemann and Valimaki, 

2010 

Kakade et al.. 

2013). oav-oer-click ad auctions with unknown click probabilities fe.g.. Babaioff et al. 

j 20 Idt Devanur and 

Kakade. 20091: Babaioff et al.L 201.5b. as well as human computation fe.g.. Ho et a 

.20)4 Gbosb and 

Hummel. 20131: Singla and Krause. 12013). In particular, a black-box reduction from an arbitrary MAB 


algorithm to an incentive-compatible algorithm is the main result in iBabaioff et al.l (12015h . in the setting 


of pay-per-click ad auctions with unknown click probabilities. The technical details in all this work (who 
are the agents, what are the actions, etc.) are very different from ours, so a direct comparison of results is 
uninformative. 

As much of our motivation comes from human computation, we note that MAB-like problems have been 
studied in several othe r setting motivated by human computation. Most of this work has focused on crowd¬ 


sourcing markets, see ISlivkins and VaiighanI (120131) for a di scussion; specific topi cs include matc hing of 
tasks and workers (e.g.. Ho et al.ll2013 : [Abraham et al. ^ 2013b and pricing decisions (Ho et al. ^ 2014 ^Singla 


and Krause, 2013 : Badanidivuru et al. . 2013 ). Also, Ghosh and Hummel ( 2013b considered incentivizing 
high-quality user-generated content. 


3 Model and preliminaries 

We define fhe basic model, called BIC bandit exploration-, the generalizations are discussed in Section |7] 
A sequence of T agents arrive sequentially to the planner. In each round t, the interaction protocol is as 
follows: a new agent arrives, the planner sends this agent a signal at, the agent chooses an action it in a set 
A of actions, receives a reward for this action, and leaves forever. The signal at includes a recommended 
action It € A. This entire interaction is not observed by other agents. The planner knows the value of T. 
However, a coarse upper bound would suffice in most cases, with only a constant degradation of our results. 

The planner chooses signals at using an algorithm, called recommendation algorithm. If at = It = it 
(i.e., the signal is restricted to the recommended action, which is followed by the corresponding agent) 
then the setting reduces to multi-armed bandits {MAB), and the recommendation algorithm is a bandit algo- 


10 


























































































































































































rithm. To follow the MAB terminology, we will use arms synonymously with actions', we sometimes write 
“play/pull an arm” rather than “choose an action”. 

Rewards. For each arm i there is a parametric family T’i(-) of reward distributions, parameterized by 
the expected reward /ij. The reward vector fl = (/ii,. .., //m) is drawn from some prior distribution V^. 
Conditional on the mean //j, the realized reward when a given agent chooses action i is drawn independently 
from distribution In this paper we restrict attention to the case of single parameter families of 

distributions, however we do not believe this is a real restriction for our results to hold. 

The prior and the tuple V = (T’i(-) : i € A) constitute the (full) Bayesian prior on rewards, denoted 

V. It is known to all agents and to the plannerjZ] The expected rewards /r* are not known to either. 

For each arm i, let be the marginal of on this arm, and let fJ- = E[/rj] be the prior mean reward. 

W. l.o.g., re-order the arms so that > ■ ■ ■ > The prior V is independent across arms if the 

distribution is a product distribution, i.e., = Vf x ... x V^. 

Incentive-compatibility. Each agent t maximizes her own Bayesian expected reward, conditional on any 
information that is available to him. Recall that the agent observes the planner’s message at, and does not 
observe anything about the previous rounds. Therefore, the agent simply chooses an action i that maximizes 
the posterior mean reward In particular, if the signal does not contain any new information about 

the reward vector /r, then the agent simply chooses an action i that maximizes the prior mean reward E[/ri]. 

Definition 3.1. A recommendation algorithm is Bayesian incentive-compatible (BIC) if 


E[p,i\at, It = i\> maxE[^df7t, It = i] 
jeA 


Vf G [T], Vr G A. 


( 1 ) 


Equivalently, the algorithm is BIC if the recommended action i = It maximizes the posterior mean 
reward E[/rj|(Tt]. We will say that the algorithm is strongly BIC if Inequality ([B is always strict. 

Throughput this paper, we focus on BIC recommendation algorithms with at = It- As observed in 
Kremer et al.l ( 2014 ). this restriction is w.l.o.g. in the following sense. Eirst, any recommendation algorithm 


can be made BIC by re-defining It to lie in argmaxj E[/rj|(Tt]. Second, any BIC recommendation algorithm 
can be restricted to at = It, preserving the BIC property. Note that the first step may require full knowledge 
of the prior and may be computationally expensive. 

Thus, for each agent the recommended action is at least as good as any other action. Eor simplicity, 
we assume the agents break ties in favor of the recommended action. Then the agents always follow the 
recommended action. 


Regret. The goal of the recommendation algorithm is to maximize the expected social welfare, i.e., the 
expected total reward of all agents. Eor BIC algorithms, this is just the total expected reward of the algorithm 
in the corresponding instance of MAB. 

We measure algorithm’s performance via the standard definitions of regret. 


Definition 3.2 (Ex-post Regret). The ex-post regret of the algorithm is: 

T 

Rij.{T) = r(max^j) - E 


X] I F 

Lt=i 


The Bayesian regret of the algorithm is: 


Rp{T) = E 


r(max/ri) - V/r/, 

7. < ^ 


t=\ 




^ As mentioned in Introduction, we also consider an extension to a partially known prior. 


( 2 ) 


( 3 ) 
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The ex-post regret is specific to a particular reward vector /r; in the MAB terminology, it is sometimes 
called MAB instance. In ^ the expectation is taken over the randomness in the realized rewards and the 
algorithm, whereas in Q it is also taken over the prior. The last summand in Q is the expected reward of 
the algorithm. 

Ex-post regret allows to capture “nice” MAB instances. Define the gap of a problem instance p as 
A = p* — maxj:^.<^* Pi, where p* = max* pi. In words: the difference between the largest and the second 
largest expected reward. One way to characterize “nice” MAB instances is via la rge gap. There are MAB 
algorithms with regret 0(min(^ log T, ^/T]ogT)) for a constant number of arms I Auer et al. (2002 a ). 

The basic performance guarantee is expressed via Bayesian regret. Bounding the ex-post regret is par¬ 
ticularly useful if the prior is not exactly known to the planner, or if the prior that everyone believes in may 
not quite the same as the true prior. In general, a ex-post regret also guards against “unlikely realizations”. 
Besides, a bound on the ex-post regret is valid for every realization of the prior, which is reassuring for the 
social planner and also could take advantage of “nice” realizations such as the ones with large “gap”. 


Conditional expectations. Throughout, we often use conditional expectations of the form ]E[A|i? = b] 
and E[A|i?] where A and B are random variables. To avoid confusion, let us emphasize that E[A|S = b] 
evaluates to a scalar, whereas E[A|i?] is a random variable that maps values of B to the corresponding 
conditional expectations of A. At a high level, we typically use E[A|i? = b] in the algorithms’ specifications, 
and we often consider E[A|i?] in the analysis. 


Concentration inequalities. For our detail-free algorithm, we will use a well-known concentration inequal¬ 
ity known as Chernojf-Hoeffding Bound. This co ncentration inequality, in slightly different formulations, 
can be found in many t extbooks and sury eys fe.g.. lMitzenmacher and UpfalL l2005h . We use a formulation 
from the original paper (IHoeffdingl Il963h . 


Theorem 3.3 (Chernoff-Hoeffding Bound). Consider n I.I.D. random variables Xi... Xn with values in 
[0,1]. Let X = ^ their average, and let p = E[2l]. Then: 

Y>t{\X - p\<5)>l-2e-^'^^^ V(5g(0,1]. (4) 


In a typical usage, we consider the high-probability event in Equation ([U) for a suitably chosen random 
variables Xi (which we call the Chemoff event), use the above theorem to argue that the failure probability 
is negligible, and proceed with the analysis conditional on the Chernoff event. 


4 Basic technique: sampling the inferior arm 

A fundamental sub-problem in BIC bandit exploration is to incentivize agents to choose any arm i > 2 
even once, as initially they would only choose aim 1. (Recall that arms are ordered according to their prior 
mean rewards.) We provide a simple stand-alone BIC algorithm that samples each arm at least k times, for 
a given k, and completes in time k times a prior-dependent constant. This algorithm is the initial stage of 
the black-box reduction, and (in a detail-free extension) of the detail-free algorithm. 

In this section we focus on the special case of two arms, so as to provide a lucid introduction to the 
techniques and approaches in this paper. (Extension to many aims, which requires some additional ideas, is 
postponed to Section |5]). We allow the common prior to be correlated across arms, under a mild assumption 
which we prove is necessary. For intuition, we explicitly work out the special case of Gaussian priors. 

Restricting the prior. While consider the general case of correlated per-action priors, we need to restrict 
the common prior V so as to give our algorithms a fighting chance, because our problem is hopeless for 
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some priors. As discussed in the Introduction, an easy example is a prior such that /ri and //i — fi 2 are 
independent. Then samples from arm 1 have no bearing on the conditional expectation of /ii — fj, 2 , and 
therefore cannot possibly incentivize an agent to try arm 2. 

For a “fighting chance”, we assume that after seeing sufficiently many samples of arm 1 there is a positive 
probability that arm 2 is better. To state this property formally, we denote with the random variable that 
captures the first k outcomes of arm 1, and we let = E[/r 2 — Ft I <51] be the conditional expectation of 
^2 — /it a function S^. We make the following assumption: 

(PI) Pr > O) > 0 for some prior-dependent constant k = kp < oo. 

In fact. Property (F{T]) is “almost necessary”: it is necessary for a strongly BIC algorithm. 

Lemma 4.1. Consider an instance of BIC bandit exploration with two actions such that Property (FUJ does 
not hold. Then a strongly BIC algorithm never plays arm 2. 

Proof. Since Property (F{T]| does not hold, by Borel-Cantelli Lemma we have 

Pr (X* < O) = 1 Vf G N. (5) 

We prove by induction on t that the f-th agent cannot be recommended arm 2. This is trivially true 
for t = 1. Suppose the induction hypothesis is true for some t. Consider an execution of a strongly BIC 
algorithm. Such algorithm cannot recommend arm 2 before round t + 1. Then the decision whether to 
recommend arm 2 in round f + 1 is determined by the t outcomes of action 1. In other words, the event 
U = {It+i = 2} belongs to cr(5*), the sigma-algebra generated by X^. Therefore, 

E[p2 -pi\U] = = E[X^\U] < 0. 

The last inequality holds by dS]). So a strongly BIC algorithm cannot recommend arm 2 in round f -|- 1. This 
completes the induction proof. ■ 

Algorithm and analysis. We provide a simple BIC algorithm that samples both arms at least k times. The 
time is divided into A: -|- 1 phases of L rounds each, except the first one, where L is a parameter that depends 
on the prior. In the first phase the algorithm recommends arm \ to K = max{L, k] agents, and picks 
the “exploit arm” a* as the arm with a highest posterior mean conditional on these K samples. In each 
subsequent phase, it picks one agent independently and uniformly at random to explore arm 2. All other 
agents exploit using arm a*. A more formal description is given in Algorithm [U 

We prove that the algorithm is BIC as long as L is larger than some prior-dependent constant. The idea is 
that, due to information asymmetry, an agent recommended arm 2 does not know whether this is because of 
exploration or exploitation, but knows that the latter is much more likely. For large enough L, the expected 
gain from exploitation exceeds the expected loss from exploration, making the recommendation BIC. 

Lemma 4.2. Consider BIC bandit exploration with two arms. Assume Property (F0 holds for some kp, 
and let Y = X^'^. Algorithni\J\is BIC as long as 

l+E|F|r>oi'p'"rV>0))' ® 

The algorithm collects at least k samples from each arm, and completes in kL + max(fc, L) rounds. 
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ALGORITHM 1: A BIC algorithm to collect k samples of both arms. 

Parameters: A:, L G N 


1 

2 

3 

4 

5 

6 


In the first K = max{L, k] rounds recommend arm 1 ; 

Let = (r|,..., rf-) be the corresp. tuple of rewards; 

Let a* = argmaXttgji 2 } E[/ia|sf-], breaking ties in favor of arm 1; 
foreach phase n = 2, ... ,A: + ldo 

From the set P of the next L agents, pick one agent pQ uniformly at random; 
Every agent p ^ P — {po} is recommended arm a*; 

Player po is recommended arm 2 


end 


Note that the denominator in ® is strictly positive by Property (F[I]). Indeed, the property implies that 
Pr (y > r) = 5 > 0 for some r > 0, so E[y|y > 0] > rJ > 0. 

Pro of. The BIC constraint is trivially satisfied in the initial phase. Bv a simple observation from Ki'emer 


et al. (1201411 . which also applies to our setting, it suffices to show that an agent is incentivized to follow a 
recommendation for aim 2. Focus on one phase n > 2 of the algorithm, and fix some agent p in this phase. 
Let Z = p 2 — rii- It suffices fo prove fhat E[Z|/p = 2 ] Pr {Ip = 2) > 0, i.e., thaf an agent recommended 
arm 2 is not incentivized to switch to arm 1 . 


Let Si be the random variable which represents the tuple of the initial samples sp. Observe that the 
event Ip = 2 is uniquely determined by the initial samples and by the random bits of the algorithm, 
which are independent of Z. Thus by the law of iterated expectations: 

E[Z|/p = 2] = E [E [Z\S^] \Ip = 2] = E[A:|/p = 2], 

where X = = E [ZlS'f^]. There are two possible disjoint events under which agent p is recommended 

arm 2: either £i = {X > 0} or £2 = < 0 and p = po}. Thus, 

E[Z\Ip = 2] Pr {Ip = 2) = E[X\Ip = 2] Pr {Ip = 2) 

= E[y l^i] Pr {£ 1 ) + E[y 1 ^ 2 ] Pr {£ 2 ), 

Observe that: 

Pr {£ 2 ) = Pr (X < 0) • Pr (p = Po I y < 0) = i Pr (AT < 0). 

Moreover, X is independent of the event {p = po}. Therefore, we get: 

E[Z|Ip = 2] Pr {Ip = 2) = E[X|A: > 0] Pr (X > 0) + ^ E[y|X < 0] Pr (AT < 0). 

Thus, for K[Z\Ip = 2] > 0 it suffices to pick L such that: 


- e[a:|a: < o] Pr (x < o) _ 

1j ^ -;- - -;--— — 1 — 


E[X] 


E[X|X > 0] Pr {X > 0) E[X|X > 0] Pr {X > 0) 

The Lemma follows because E[X] = E[Z] = p® “ ^rid 

E[X|X > 0] Pr (X > 0) > E[y|y > 0] Pr (X > 0). 
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The latter inequality follows from Lemma 14.31 as stated and proved below. Essentially, it holds because 
X = conditions on more information than Y = X^'’^: respectively, the first K samples and the first k^p 
samples of arm 1, where K > kp. ■ 


Lemma 4.3. Let S and T be two random variables such that T contains more information than S, i.e. 
if (t{S) and cr{T) are the sigma algebras generated by these random variables then cr{S) C cr(T). Let 
X^ = E[/(/r)|S'] and X'^ = E[/(/r)|r]/or some function /(•) on the vector of means pL. Then: 

> 0] Pr > O) > E[X^|X'^ > 0] Pr > O) . (7) 

Proof Since the event {X*^ > 0} is measurable in (t{S) we have: 

E[X^|X^ > 0] = E [E[/(^)|5]|X^ > 0] = E[/(/r)|X^ > 0]. 

Since (t{S) C o'{T)-. 

E[/(^)|X^ > 0] = E[E[f{pi)\T]\X^ > 0] = E[X'^|X^ > 0]. 

Thus we can conclude: 

EfX^lX*^ > 0] Pr (X*^ > O) = E[X'^|X^ > 0] Pr (X^ > o) = / X'^ 

J{xs>o} 

< [ X^ = EfX^lX"^ > 0] Pr (X'^ > 0) . ■ 

J{XT>0} 


Gaussian priors. To give more intuition on constants L, kp and random variable Y from Lemma l4~2l we 
present a concrete example where the common prior is independent across arms, the prior on the expected 
reward of each arm is a Normal distribution, and the rewards are normally distributed conditional on their 
mean. We will use N{fi, cj^) to denote a normal distribution with mean // and variance cr^. 

Example 4.4. Assume the prior V is independent across arms, pi ~ N{pL^,af) for each arm i, and the 
respective reward rj is conditionally distributed as E[r*l//j] ~ N{pLi, pj). Then for each /c G N 


\rk 

A ~ 


^ \ T2 ~ 


crt 


1 + 




( 8 ) 


It follows that Property (F[T]) is satisfied for any k = kp > 1. 


Proof Observe fhaf r\ = + e\, where ~ X (0, pf) and Ci, wifh ~ X (0, a’f). Therefore, 


E[pfSt] = ^ 


1 //O 1 X 


Pi 




= Ti + 




fc-4 / 1 ^ 


-j Y k ‘ —7 
T Pi 


1 


t=l 


= T'i + 


k ■ af 


pfYk-di 




Since Ci + | e! ~ X (O, a\ Y p\/k'), if follows fhaf E[/ills'll is a linear fransformafion of a Normal 
disfribufion and Iherefore 


E[pi\S'l] ~ X 



k ■ af 
pfYk-af 


2 



= X 




pfYk-afJ- 


Equafion ([H) follows because X^ 


E[^2 - filial] =/^^-E[w|Sf]- 
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Observe that as /c —)• oo, the variance of increases and converges to the variance of /xi- Intuitively, 
more samples can move the posterior mean further away from the prior mean /x?; in the limit of many 
samples will be distributed exactly the same as /Xg — /xi. 

While any value of the parameter kp is sufficient for BIC, one could choose k-p to minimize the phase 
length L. Recall from Lemma that the (smallest possible) value of L is inversely proportional to the 
quantity E[y|y > 0] Pr {Y > 0). This is simply an integral over the positive region of a normally dis¬ 
tributed random variable with negative mean, as given in Equation (l8]l. Thus, increasing kp increases the 
variance of this random variable, and therefore increases the integral. For example, kp = pf/erf is a rea¬ 
sonable choice if pf af. 

5 Black-box reduction from any algorithm 

In this section we present a “black-box reduction” that uses a given bandit algorithm ^ as a black box to 
create a BIC algorithm for BIC bandit exploration, with only a minor increase in performance. 

The reduction works as follows. There are two parameters: k, L. We start with a “sampling stage” which 
collects k samples of each arm in only a constant number of rounds. (The sampling stage for two arms is 
implemented in SectionlH) Then we proceed with a “simulation stage”, divided in phases of L rounds each. 
In each phase, we pick one round uniformly at random and dedicate it to A. We run A in the “dedicated” 
rounds only, ignoring the feedback from all other rounds. In the non-dedicated rounds, we recommend an 
“exploit arm”: an arm with the largest posterior mean reward conditional on the samples collected in the 
previous phases. A formal description is given in Algorithm |2l 


ALGORITHM 2: Black-box reduction: the simulation stage 
Input : A bandit algorithm A', parameters fc, L G N. 

Input : A dataset 5^ which contains k samples from each arm. 

1 Split all rounds into consecutive phases of L rounds each; 

foreach phase n = 1,.. . do 

2 Let a* = arg max-t^A IE[/Xj|5"’] be the “exploit arm”; 

3 Query algorithm A for its next aim selection in, 

4 Pick one agent p from the L agents in the phase uniformly at random; 

5 Every agent in the phase is recommended a*, except agent p who is recommended x„; 

6 Return to algorithm A the reward of agent p\ 

1 Set = 5” U {all samples collected in this phase} 

end 


The sampling stage for m > 2 arms is a non-trivial extension of the two-arms case. In order to convince 
an agent to pull an aim z > 1, we need to convince him not to switch to any other arm j < i. We implement 
a “sequential contest” among the arms. We start with collecting samples of arm 1. The rest of the process 
is divided into m — 1 phases 2,3, ... , m, where the goal of each phase i is to collect the samples of arm 
i. We maintain the “exploit arm” a*, the arm with the best posterior mean given the previously collected 
samples of arms j < i. The z-th phase collects the samples of aim z using a procedure similar to that in 
Section m k agents chosen u.a.r. are recommended arm z, and the rest are recommended the exploit arm. 
Then the current exploit arm is compared against arm z, and the “winner” of this contest becomes the new 
exploit aim. The pseudocode is in Algorithm [3l following the description above. 
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ALGORITHM 3: Black-box reduction: the sampling stage. 


Parameters: k,L S N; the number of arms m. 

For the first k agents recommend arm 1, let = {rj,..., be the tuple of rewards; 
foreach ann i > 1 in increasing lexicographic order do 

Let a* = arg max^g^ , • • • , breaking ties lexicographically; 

From the set P of the next L ■ k agents, pick a set Q of A: agents uniformly at random; 
Every agent p £ P — Q is recommended arm a *; 

Every agent p £ Q is recommended aim f; 

Eet be the tuple of rewards of arm i returned by agents in Q 


end 


The rest of this section is organized as follows. Eirst we state the incentive-compatibility guarantees 
(Section lSTl) . then we prove them for the simulation stage (Section [S!2l) and the sampling stage (Section [531) . 
Then we characterize the performance of the black-box reduction, in terms of average rewards, total rewards, 
and Bayesian regret (Section lS^ . We also consider a version in which A outputs a prediction in each round 
(Section [531) . The technically difficult part is to prove BIC, and trace out assumptions on the prior which 
make it possible. 

5.1 Incentive-compatibility guarantees 

As in the previous section, we need an assumption on the prior P to guarantee incentive-compatibility. 

(P2) Eet be the the random variable representing k rewards of action i, and let 

X^ = minE[p,-p,\S^„...,Sl,]. (9) 

There exist prior-dependent constants kp,Tp, pp > 0 such that 

Pr > Tp'j > Pv V/c > fc-p, i £ A. 

Informally: any given arm i can “a posteriori” be the best arm by margin rp with probability at least pp after 
seeing sufficiently many samples of each arm j < i. We deliberately state this property in such an abstract 
manner in order to make our BIC guarantees as inclusive as possible, e.g., allow for correlated priors, and 
avoid diluting our main message by technicalities related to convergence properties of Bayesian estimators. 

Eor the special case of two arms. Property (iQ follows from Property (F{B> see Section ISTTl Recall that 
the latter property is necessary for any strongly BIC algorithm. 

Theorem 5.1 (BIC). Assume Property holds with constants kp, Tp, pp. Then the black-box reduction 
is BIC, for any bandit algorithm A if the parameters k, L are larger than some prior-dependent constant. 
More precisely, it suffices to take k >kf, and L > 2 -\- , where = ]E[maxjg^ pi\. 

If the common prior V puts positive probability on the event that {pi — max^^j pj > r}, then Prop¬ 
erty (Fj2l) is plausible because the conditional expectations tend to approach the true means as the number of 
samples grows. We elucidate this intuition by giving a concrete, and yet fairly general, example, where we 
focus on priors that are independent across arms. (The proof can be found in Section HI) 
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Lemma 5.2. Property (I^ holds for a constant number of arms under the following assumptions: 

(i) The prior V is independent across arms. 

(ii) The prior on each pi has full support on some bounded interval [a, b]. [§ 
f Hi) The prior mean rewards are all distinct: p\> p^> ... > p^. 

(iv) The realized rewards are bounded (perhaps on a larger interval). 

5.2 Proof of Theorem I5.lt the simulation stage 

Instead using Property (Fj2]l directly, we will use a somewhat weaker corollary thereof (derived Section ISTTI) : 

(P3) Let Afc be a random variable representing ki > k samples from each action i. There exist prior- 
dependent constants klf < oo and Tp, pp > 0 such that 

\/k'>kp Vi € A Pr [ min E [pi — pfh-k] > Tp] > pp. 

\j€A-{i} ) 

Let us consider phase n of the algorithm. We will argue that any agent p who is recommended some 
arm j does not want to switch to some other arm i. More formally, we assert that 

E[/ij - pi\T^ = f\ Pr (4 = 2) > 0. 

Let S”^ be the random variable which represents the dataset collected by the algorithm by the begin¬ 
ning of phase n. Let = E[/ij — Pi\S^] and = minjg^/{j} It suffices to show: 

E[X7I4 = J] Pr(4 = i) >0. 

There are two possible ways that agent p is recommended arm j : either arm j is the best posterior arm, 
i.e., X^ > 0, and agent i is not the unlucky agent or X^ < 0, in which case arm j cannot possibly be the 
exploit ai'm a* and therefore. Ip = j happens only if agent p was the unlucky one agent among the L agents 
of the phase to be recommended the arm of algorithm A and algorithm A recommended arm j. There are 
also the followings two cases: (i) X^ = 0 or (ii) X^ > 0, agent p is the unlucky agent and algorithm A 
recommends arm j. However, since we only want to lower bound the expectation and since under both these 
events X^ > 0, we will disregard these events, which can only contribute positively. 

Denote with U the index of the “unlucky” agent selected in phase n to be given recommendation of the 
original algorithm A. Denote with Aj the event that algorithm A pulls arm j at iteration n. Thus by our 
reasoning in the previous paragraph: 

E [X^\Ip = j] Pr (Ip = j) > E[X7|X7 >0,UAp]P^ >0,UAp) 

+ E[X^\Xf <0,U = p, A]] Pr {X^ <0,U = p, A]) (10) 

Since X^ is independent of U, we can drop conditioning on the event {U = p} from the conditional 
expectations in Equation (fTOt . Further, observe that: 

Pr (X; <0,U = p, A]) = Fi{U = p\X^ < 0, A]) Pr {X^ < 0, A]) 

= iPr {X;i < 0,A]) 

Pr (W" > 0, (7 / p) = Pr ((7 / p I X" > 0) Pr (X” > O) 

= (i-7:)P''X">0) 

®That is, the prior on pi assigns a strictly positive probability to every open subinterval of [a, 6], for each arm i. 
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Plugging this into Equation (fTOl) . we can re-write it as follows: 


E [X]\Ip = j] Pr {Ip = j) > (1 - i) E[X7|X; > 0] Pr {X^ > O) 

+ i E[X^ I X^ < 0, Pr (X” < 0, . (11) 

Observe that conditional on {X^ < 0}, the integrand of the second part in the latter summation is always 
non-positive. Thus integrating over a bigger set of events can only decrease it, i.e.: 

E[X^\X^ < Pr {X^ < 0,A]) > E[X^\X^ < 0] Pr {X^ < O) 

Plugging this into Equation (fTTl) . we can lower bound the expected gain as: 

E [X^\Ip = j] Pr {Ip = j) > (1 - x) > 0] Pr (X” > O) + ^ EfX^lX” < 0] Pr (X" < O) 

= E[X"|Xj^ > 0] Pr (X" > 0) ^ i E[X"]. 


Now observe that: 


E[XA = 


E 

rnin E[pj - pi\S'^] 

> E 

E 

min (pj - pi)\S'^ 

= E 

Lij — max iii 


pG^/lil J 



pGA/b'} J 


iGA/{j} 


l^j A^max • 


By property (I©, we have that Pr ^X” > r-pj > pp. Therefore: 

E[X;|X7 > 0] Pr (X;^ > 0) > • pp > 0. 


Thus the expected gain is lower bounded by: 

E [X]\Ip = j] Pr {Ip = j) > Tp ■ - A^max)- 

The latter is non-negative if: 

L > + 2 

Tp ■ pp 

Since this must hold for every any j, we get that for incentive compatibility it suffices to pick: 

max,(z4(Wma,r — Ai?) 

^ _ .yg^vf^max r'jj 2 _ ^max Pm _|_ 2 

Tp ■ Pp Tp ■ Pp 

5.3 Proof of Theorem l5.lt the sampling stage 

Compared to Theorem l5.ll a somewhat weaker bound on parameter L suffices: L > 1 -|- . 

Denote X^- = E[pi — pjlSf ,..., and Xj = X^ = min^yj Xjj. 

The algorithm can be split into m phases, where each phase except the first one lasts L ■ k rounds, and 
agents are aware which phase they belong to. We will argue that each agent p in each phase i has no incentive 
not to follow the recommended arm. If an agent in phase i is recommended any arm j / then she knows 
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that this is because this is the exploit arm a*; so by definition of the exploit arm it is incentive-compatible 
for the agent to follow it. Thus it remains to argue that an agent p in phase i who is recommended arm i 
does not want to deviate to some other arm j / i, i.e., that E[^j — Pj\Ip = i] Pr {Ip = i) > 0. 

By the law of iterated expectations: 

npi - pj\Ip = i] Pr {Ip = i)= E[E[/r, - pj\St ..., St^]\Ip = i] Pr {Ip = i) 

= E[Xij\Ip = i] Pr {Ip = i). 

There are two possible disjoint events under which agent p is recommended arm i, either £i = {Xi > 0} 
or £2 = {^i < 0 and p G Q}. Thus: 

E[pi - p,j\Ip = i] Pr {Ip = i) = E[Xij\Ip = i] Pr {Ip = i) 

= E[X,j\£^] Pr (fi) + E[Xij\£ 2 ] Pr (^2) • 


Observe that: 

Pr {£ 2 ) = Pr (Xi < 0) • Pr (p G Q I Xi < 0) = } Pr (Xi < 0). 

1j 

Moreover, X^ is independent of the event {p G Q}. Therefore, we get: 

E[//i - pj\Ip = i] Pr {Ip = i) = E[Xij|Xi > 0] Pr (Xi > 0) + ^ E[Xij|Xi < 0] Pr (Xi < 0) 

= E[Xij |Xi > 0] Pr (Xi > 0) (1 - i) + i E[X^j] 

= E[Xi,|X, > 0] Pr (Xi > 0) (1 - i) + i (/r° - 

> E[XilX, > 0] Pr (X, > 0) (1 - i) + i (p° - pO) . 

In the above, we used the facts that E[Xij] = E[fii — pj] = Pi — and that X^- > Xi. Thus it suffices fo 
pick L such fhaf: 

^ p'j - Pi 

E[Xi\Xi > 0] Pr (Xi > 0) ■ 

The lemma follows by our choice of L and since p^ — Pi < p^ — p^ for all j G A, and by Properly (Fj2l) we 
have lhal E[Xi|Xi > 0] Pr (Xi > 0) > r-p • pp. 

5.4 Performance guarantees 

Algorifhm’s performance can be quanlified in several ways: in terms of fhe fofal expected reward, Ihe 
average expecled reward, and in terms of Bayesian regret; we state guarantees for all three. Unlike many 
results on regret minimization, we do not assume that expected rewards are bounded. 

Theorem 5.3 (Performance). Consider the black-box reduction with parameters k, L, applied to bandit 
algorithm A. Let c = k -\- Lk be the number of rounds in the sampling stage. Then: 

(a) Let U^{ti,T 2 ) be the average Bayesian-expected reward of algorithm A in rounds [ri, T2]. Then 

(7_4 ic(c + 1, c + r) > U_a{1,\j / L\) for any duration t . 
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(b) Let Uj[{t) be the total Bayesian-expected reward of algorithm A in the first [ij rounds. Then 


> L - U^A 



+ c • E 


min Pi 
iGA 


(c) Let be the Bayesian regret of algorithm A in the first [tj rounds. 


Rjpc(t) < L ■ RA_{t/L) + c • E 


max Pi — mm pi 
iGA iGA 


In particular, if pi E [0,1] for all arms i, and the original algorithm A achieves asymptotically 
optimal Bayesian regret 0{s/t), then so does A^^. 


Proof. Recall that a* and are, resp., the “exploit” aim and the original algorithm’s selection in phase re. 

Part (a) follows because E Tin 

of the “exploit” arm we have E[/ia* |5”'] > E[//j^ |5”'] for each phase re. 

Parts (b) and (c) follow easily from part (a) by going, resp., from average rewards to total rewards, and 
from total rewards to regret. We have no guarantee whatsoever on the actions chosen in the first c rounds, so 
we assume the worst: in part (b) we use the worst possible Bayesian-expected reward E [maxjg^ pi], and in 
part (c) we use worst possible per-round Bayesian regret E [maxjgA Pi — minjgyi pi]. ■ 


= N), where N is the number of phases, and by definition 


5.5 The black-box reduction with predictions 

We can also allow the original algorithm A to output a prediction in each round. For our purposes, the 
prediction is completely abstract: it can be any element in the given set of possible predictions. It can be a 
predicted best arm, the estimated gap between the best and the second-best arms (in terms of their expected 
rewards), or any other information that the planner wishes to learn. For example, as pointed out in the 
Introduction, the planner may wish to learn the worst action (or several worst actions, and possibly along 
with their respective rewards), in order to eliminate these actions via regulation and legislation. 

The black-box reduction can be easily modified fo oufpuf predicfions, visible fo fhe planner buf nof fo 
agenfs. In each phase re, fhe reduction records fhe predicfion made by A, denofe if by fin- In each round 
of fhe initial sampling sfage, and in each round of fhe firsf phase (re = 1) fhe reduction oufpufs fhe “null” 
predicfion. In each round of every subsequenf phase re > 2, fhe reduction oufpufs fin-i- 

Predicfions are nof visible fo agenfs and fherefore have no bearing on fheir incentives; fherefore, fhe 
resulfing algorifhm is BIC by Theorem 15.11 

We are inferesfed in fhe sample complexity of A^^: essenfially, how fhe predicfion qualify grows over 
time. Such guaranfee is meaningful even if fhe performance guaranfees from Theorem 15.31 are nof sfrong. 
We provide a very general guaranfee: essenfially, A^^ learns as fasf as A, up fo a consfanf factor. 

Theorem 5.4. Consider the black-box reduction with parameters k, L, and let c = k -\- Lk be the number 
of rounds in the sampling stage. Then the prediction returned by algorithm in each round t > c-\- L 
has the same distribution as that returned by algorithm A in round [(f — c)/L\. 


6 Detail-free algorithm and ex-post regret 

So far we relied on precise knowledge of fhe prior (and pai'ameferized reward disfribufions), mainly in order 
fo perform Bayesian posferior updafes, and only provided performance guaranfees in expecfafion over fhe 


21 








prior. In this section we present an algorithm, called DetailFreeRace, that is “detail-free” (requires only 
a very limited knowledge of the prior), and enjoys prior-independent performance guarantees. 

The algorithm is detail-free in the sense that it only needs to know the prior mean a certain prior- 

dependent threshold N-p, and it only needs to know them approximately. Since the algorithm does not know 
the exact prior, it avoids posterior updates, and instead estimates the expected reward of each arm with its 
sample average. The algorithm achieves asymptotically optimal bounds for ex-post regret, both for “nice” 
MAB instances and in the worst case over all MAB instances. 

We use an assumption to restrict the common prior, like in the previous sections. However, we need 
this new assumption not only to give the algorithm a “fighting chance”, but also to ensure that the sample 
average of each arm is a good proxy for the respective posterior mean. Hence, we focus on priors that are 
independent across arms, assume bounded rewards, and full support for each 

(P4) The prior is independent across arms; all rewards lie in [0,1]; the means pi have full support on [0,1]. 

However, we make no assumptions on the distribution of each pi and on the parameterized reward distribu¬ 
tions V{ni), other than the bounded support. 

The high-level guarantees for DetailFreeRace can be stated as follows: 

Theorem 6.1. Assume a constant number m of arms. Algorithm DetailFreeRace is parameterized by 
two positive numbers {p, N). 

(a) The algorithm is BIC if Property f/j?]) holds and > 0, and the parameters satisfy p^ < /i < 2 p^ 
and N > Np, where Np < oo is a threshold that depends only on the prior. 

(b) Let p* = maxj pi and A = p* — maxj:^.<^* pi be the “gap”. The algorithm achieves ex-post regret 

R{t) < min (^f{N) ■ A -f tA^ < min (2f{N), y/0{t ■ log(rA))^ , 

where T is the time horizon and f{N) = N ->r N‘^{m — 1). This regret bound holds for each round 
t <T and any value of the parameters. 


The BIC property requires only a very limited knowledge of the prior: an upper bound on the threshold 
Np, and a 2-approximation for the prior mean reward p^. 

Due to the detail-free property we can allow agents to have (somewhat) different priors. For example, all 
agents can start with the same prior, and receive idiosyncratic signals before the algorithm begins. Further, 
agents do not even need to know their own priors exactly. We state this formally as follows: 

Corollary 6.2. The BIC property holds even if each agent has a different prior, as long as the conditions 
in the theorem are satisfied for the chosen parameters {p, N) and all agents’ priors. Further, a particular 
agent does not need to know his own prior exactly in order to verify that the BIC property holds for him; 
instead, it suffices if he believes that the conditions in the theorem are satisfied for his prior. 


Our algorithm consists of two stages. The sampling stage samples each arm a given number of times; 
it is a detail-free versi on of Algorithm]^ The racing stage is a BIC version of Active Arms Elimination, a 
bandit algorithm from Even-Par et al. ( 20061) . Essentially, arms “race” against one another and drop out 
over time, until only one arm remains^ 


®This approach is known as Hoeffding race in the literature, as Azuma-Hoeffding inequality is used to guarantee the high- 
probability selection of the best arm; see Related Work for relevant citations. 
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We present a somewhat more flexible parametrization of the algorithm than in Theorem 16.11 which 
allows for better additive constants. In particular, instead of one parameter N we use several parameters, so 
that each parameter needs to be larger than a separate prior-dependent threshold in order to guarantee the 
BIC property. To obtain Theorem 16.11 we simply make all these parameters equal to N, and let Np be the 
largest threshold. 

The rest of this section is organized as follows: we present the sampling stage in Section Ihdl then the 
racing stage in Section [6^ and wrap up the proof of the main theorem in Section 1631 For better intuition, 
in Appendix El we provide a standalone exposition for the (substantially simpler) special case of two arms. 

6.1 The detail-free sampling stage 

To build up intuition, consider the special case of two aims. Our algorithm for this case is similar in structure 
to Algorithm [H but chooses the “exploit” arm a* in a different way. Essentially, we use the sample average 
reward pi of arm 1 instead of its posterior mean reward, and compare it against the prior mean reward 
fi 2 of arm 2 (which is the same as the posterior mean of aim 2, because the common prior is independent 
across arms). To account for the fact that fti is only an imprecise estimate of the posterior mean reward, we 
pick arm 2 as the exploit arm only if exceeds jli by a sufficient margin. The detailed pseudocode and a 
standalone analysis of this algorithm can be found in Appendix lAl 

Now we turn to the general case of m > 2 arms. On a very high level, the sampling stage proceeds as 
in Algorithm |3] (the non-detail-free version): after collecting some samples from arm 1, the process again is 
split into m — 1 phases such that arm i is explored in phase i. More precisely, rounds with arm i are inserted 
uniformly at random among many rounds in which the “exploit arm” a* is chosen. 

However, the definition of the exploit arm is rather tricky, essentially because it must be made “with high 
confidence” and cannot based on brittle comparisons. Consider a natural first attempt: define the “winner” 
Wi of the previous phases as an arm j < i with the highest average reward, and pick the exploit arm to be 
either Wi or i, depending on whether the sample average reward of arm Wi is greater than the prior mean 
reward of arm i. Then it is not clear that a given arm, conditional on being chosen as the exploit arm, will 
have a higher expected reward than any other arm: it appears that no Chernoff-Hoeffding Bound argument 
can lead to such a conclusion. We resolve this issue by always using arm 1 as the winner of previous phases 
{wi = 1), so that the exploit arm at phase i is either arm 1 or arm i. Further, we alter the condition to 
determine which of the two arms is the exploit arm: for each phase i > 2, this condition now involves 
comparisons with the previously considered arms. Namely, arm i is the exploit arm if only if the sample 
average reward of each aim j G (1, i) is larger than the sample average reward of arm 1 and smaller than 
the prior mean reward of arm i, and both inequalities hold by some margin. 

The pseudocode is in Algorithm |4] The algorithm has three parameters k,L,C, whose meaning is as 
follows: the algorithm collects k samples from each arm, each phase lasts for kL rounds, and C is the safety 
margin in a rule that defines the exploit arm. Note that the algorithm completes in A: -|- Lk{m — 1) rounds. 

The provable guarantee for Algorithmic is stated as follows. 

Lemma 6.3. Assume Property f/S]). Let m be the number of arms. Fix 67 > 0 and consider event 

f = {Vy G A - {1, m] ■. pi + ^ < pj < . (12) 

Algorithm^is BIC if C < ^ and parameters k, L are large enough: 
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ALGORITHM 4: The detail-free sampling stage: samples each arm k times. 
Parameters: A:, L G N and C G (0,1) 


1 For the first k agents recommend arm 1, and let be their average reward; 
foreach arm i > 1 in increasing lexicographic order do 

{l\ < — C and p\-\- C < — C for all arms j G (1, i) then 

I ^ 

else 


2 

3 

4 

5 


end 

From the set P of the next L ■ k agents, pick a set Q of k agents uniformly at random; 
Every agent p £ P — Q is recommended arm a *; 

Every agent p £ Q is recommended aim f; 

Eet p^ be the average reward among the agents in Q\ 


end 


Observe that Pr (£^) > 0 by Property (Fj4|. 

Proof. Consider agent p in phase i of the algorithm. Recall that she is recommended either aim 1 or arm i. 
We will prove incentive-compatibility for the two cases separately. 

Denote with Si the event that the exploit arm a* in phase i is equal to i. By algorithm’s specification, 

Si = {pi < min pj — C and max pj < p^ — C}. 
l<j<i i<i<* 


Part I: Recommendation for arm i. We will first argue that an agent p who is recommended arm i does 
not want to switch to any other arm j. The case j > i is trivial because no information about arms j or i has 
been collected by the algorithm and Pi > p^. Thus it suffices to consider j < i. We need to show that 

E[/i° - pj\Ip = i] Pr {Ip = i) >0. 


Agent p is in the “explore group” Q with probability and in the “exploit group” P — Q with the 
remaining probability f — Conditional on being in the explore group, the expected gain from switching 
to aim j is p^ — p^, since that event does not imply any information about the rewards. If she is in the exploit 
group, then all that recommendation for arm i implies is that event Si has happened. Thus: 

- pfip = 2] Pr {Ip = 2) = E[/i0 - pj\S,] Pr • (l - i) + 

Thus for the algorithm to be BIC we need to pick a sufficiently large parameter L: 


L>1 + 


np^-pfS,] Pr (f,)' 


It remains to lower-bound the denominator in the above equation. Eet C denote the Chernoff event: 

C = {^i £ A : \pi - pi\ < e}. 
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Observe that by Chernoff-Hoeffding Bound and the union bound: 


V^l, . . . , fJ^rn ■ (C I /ii, . • • , /J-rra) TTl ■ 6 

Now observe that since /r* G [0,1] and by the Chernoff-Hoeffding Bound: 

E[/r° - Pr (f*) = E[fi° - Pr + E[/i0 - -C] Pr [St, -c) 

> E[/r0 - Mil^t,C]Pr {£^,c) - 

Observe that conditional on the events and C, we know that p!l — > C — e, since event £i implies that 

> Aj + C* and event C implies that A^ > — e- Thus we get: 

E[/i0 - Pr (f,) > (C-e) Pr - m • 

Now let: 


£i = {fit < 


min fij — C — 2e and 

l<j<i 


max fij < — C — 2e} , 

l<j<i 


i.e., £i is a version of £i but on the true mean rewards, rather than the sample averages. Observe that under 
event C, event £i implies event £i. Thus: 

Pr(fi,C) <Pr(f„c) . (14) 


Moreover, by Chernoff-Hoeffding Bound: 

Pr {£i,C) = Pr (C I £i) Pr (£,) > (l - m ■ Pr {£i). 


Hence, we conclude that: 

E[/r° — ^j\£i] Pr (^£ij > {C — e) (^1 — m ■ Pr (£i) — m ■ 

By taking e = j and k > 2e“^ log we get that: 

E[^° - fij\£i] Pr (fi) > |C (1 - i) Pr (f,) - |CPr {£i) > ^CPi (£i) . 
Thus it suffices to pick: 


L > l-b 2 






CPr (£,)- 


(15) 


In turn, since it suffices to pick the above for j = 1. 

Part II: Recommendation for arm 1. When agent p is recommended arm 1 she knows that she is not in 
the explore group, and therefore all that is implied by the recommendation is the event -ifj. Thus we need 
to show that for any aim j > 1: 


E[pi - Pj\^£i] Pr (-.£■*) > 0. 
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It suffices to consider arms j < i. Indeed, if an agent does not want to deviate to arm j = i then she also 
does not want to deviate to any arm j > i, because at phase i the algorithm has not collected any information 
about arms j > i. 

Now observe the following: 

E[/ri - Hj\^ £i] Pr = E[/ri - - E[/ii - Hj\£i] Pr (Si) 

= /r? - + E[nj - fii\Si] Pr (Si) 

Since, by definition /r® > /i°, it suffices to show that for any j G [2, i]\ 

E[/rj - iJLi\£i] Pr (Si) > 0. 


If j = i, the above inequality is exactly the incentive-compatibility constraint that we showed in the first 
part of the proof. Thus we only need to consider arms j G [2, f — 1]. 

Similar to the first part of the proof, we can write: 

E[nj - ^ii\Si\ Pr (Si) > E[^j - ^ii\Si,C\ Pr (Si,C) - 

By similar arguments, since Si implies that fjfj > (ii + C and since C implies that > [ij — e and 
we have that the intersection of the two events implies fij > fii + C — 2e. Hence: 

E[fij - fii\Si] Pr (S^) >(C- 2e) Pr (Si,C^ - 

By Equations (fT4l) and (fTSl) 

E[/ij — ni\Si] Pr (Si) > (C — 2e) Pr (Si) — ^. 

Setting e and k same way as in the first part, we have 

E[iij-fii\Si]Pj:(Si) > iC(l-i)Pr(fi)-|CPr(f,) = ACPr(^i) >0. 


Wrapping up. We proved that both potential recommendations in phase i are BIC if we take e = (7/4 and 

8m 


k > 8(7-2 


and 


L> 1 + 2 


//O - //O 

Ml Mm 


c-Px(Si)j - c-PT(Siy 

Equation ([T^ suffices to ensure the above for all i because S = Sm C Si for all phases i < m. 

6.2 The detail-free racing stage 

The deta il-free racing stage is a BIC version of Active Arms Elimination, a bandit algorithm from Even- 


Dar et al. ( 2006ll . Essentially, arms “race” against one another and drop out over time, until only one arm 
remains. More formally, the algorithm proceeds as follows. Each arm is labeled “active” initially, but may 
be permanently “deactivated” once it is found to be sub-optimal with high confidence. Only active arms are 
chosen. The time is partitioned into phases: in each phase the set of active arms is recomputed, and then all 
active arms are chosen sequentially. To guarantee incentive-compatibility, the algorithm is initialized with 
some samples of each arm, collected in the sampling stage. The pseudocode is in Algorithm [5] 
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ALGORITHM 5: BIC race for m > 2 arms. _ 

Input : parameters fe G N and 0 > 1; time horizon T. 

Input : k samples of each arm i, denoted rj, ... , rf . 

1 Initialize the set of active arms: i? = {all arms}; 

2 Split the remainder into consecutive phases, starting from phase n = k\ 

repeat 

// Phase n 

3 Let /i” = r* be the sample average for each arm i ^ B\ 

4 Let = maxjgs be the largest average reward so far; 

5 Recompute ^ ~ | ^ A* — | > 

6 To the next \B\ agents, recommend each aim i ^ B sequentially in lexicographic order; 

7 Let be the reward of each arm i G i? in this phase; 

8 n = n + 1; 
until \B\ = 1; 

9 For all remaining agents recommend the single aim a* that remains in B 


Lemma 6.4 (BIC). Fix an absolute constant r G (0,1) and let 

Or = - / minPr I Ui — maxu,- > r I . 

^ isA V ) 

Algorithm^is BIC if Property f/@) holds and the parameters satisfy 6 > Or and k > Or log(T). 

Proof We will show that for any agent p in the racing stage and for any two arms i,j G A\ 

K[iii - pj\Ip = j] Pr {Ip = j) > 0. 

In fact, we will prove a stronger statement: 

E[p,i - max pj\Ip = j] Pr {Ip = j) > 0. (16) 

For each pair of arms i,j G A, denote Zij = pi — pj and 

Zi = Pi — max pj = min Z^j. 

3 At jAt 

For each phase n, let zfj = pf — Pj - For simplicity of notation, we assume that samples of eliminated arms 
are also drawn at each phase but simply are not revealed or used by the algorithm. 

Let Cn = be the threshold in phase n of the algorithm. Consider the “Chernoff event”: 

C = {Vn > kyi,j G A : jZij - zfj\ < c^} . 

By Chernoff-Hoeffding Bound and the union bound, for any Z, we have: 

Pr {-.C\Z) < Pr - zl\ > cnu) < E E 

n>n* iJgA n>n* ijGA 

6 
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Given the above and since Zj > — 1 we can write: 


E[Zi\Ip = i]Pr (/p = z) = E[Zi\Ip = i,C]Pr (Ip = i,C) +E[Zj|Ip = z,-iC]Pr (Ip = i,^C) 

9 

TTi 

> E[Zi|Ip = i,C] Pr (Ip = z,C)-(17) 

From here on, we will focus on the first term in Equation (fTTl) . Essentially, we will assume that Chernoff- 
Hoeffding Bound hold deterministically. 

Consider phase n. Denote n* = 0^ log(T), and recall that n > k > n*. 

We will split the integral E[Zj|Ip = i,C] into cases for the value of Z*. Observe that by the definition 
of Cn we have c„ < c„* < Hence, conditional on C, if Zj > r > 2c„*, then we have definitely 

eliminated all other arms except arm i at phase n = k > n*, since 

^ij ^ ^ij Cfi ^ Zj Cn ^ Cn- 

Moreover, if Zj < —2cn, then by similar reasoning, we must have already eliminated aim i, since 

/I” - /z* = min 4”- < min Zij + Cn = Zi + Cn < -Cn- 

Thus in that case Ip = i, cannot occur. Moreover, if we ignore the case when Zj G [0, r), then the integral 
can only decrease, since the integrand is non-negative. Since we want to lower bound it, we will ignore this 
region. Hence: 


E[Zj|Ip = i,C] Pr (Ip = i,C) > r Pr (C, Zj > r) — 2 • Pr (C, — 2cn < Zj < 0) 


> TPr(C,Zj > r) - 2 • Cn 

(18) 

> rPr(C.Z,>r)-" 

(19) 


Moreover, we can lower-bound the term Pr (C, Zj > r) using Chernoff-Hoeffding Bound: 


Pr(C,Zj>r)= Pr(C|Zj >r)Pr(Zj >r) > 1 - ^ • Pr (Zj > r) 

> (l-mV^*) •Pr(Zi >r) > fPr(Zj >t). 

Plugging this back into Equation (fT^ . we obtain: 

E[Zj|Ip = z,C] Pr (Ip = i,C)>\ - Pr (C,Zj > r). 

Plugging this back into Equation (ITtI) and using the fact that Q > Or, we obtain Equation (fT6l) . 

Lemma 6.5 (Regret). Algorithm\^with any parameters I G N and 9 >1 achieves ex-post regret 

r>(rp\ ^ 181og(r0) ,, 

nil ) < > -, where p, = maxuj. 

. p* — Pi ieA 


( 20 ) 


Denoting A = miiijgA P* — Pi, and letting n* be the duration of the initial sampling stage, the ex-post 
regret for each round t of the entire algorithm (both stages) is 


R{t) < min ( n*A tA ) < min 


2n*, v'l8flog(r0)) . 


( 21 ) 
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Pro of Sketch. The “logarith mic” re gret bound (l20b is proved via standard techniques, e.g.. see Even-Par 


et al. ( 2006h . We omit the details from this version. 

To derive the corollary (|2T]) . observe that the ex-post regret of the entire algorithm (both stages) is at 
most that of the second stage plus A per each round of the first stage; alternatively, we can also upper-bound 
it by A per round. This gives the first inequality in (|2TI) . To obtain the second inequality, note that it is trivial 
for t < n*, and for t > n* consider consider three cases, depending on the value of A: 


'R{t) < 2n* 

< R{t) < (3/6 < 
R{t) <At< 


if A > y/pJrP., 

if y/JJt < A < yj^jn*, 

if A < y/l3/t. ■ 


6.3 Wrapping up: proof of Theorem IQ] 

To complete the proof of Theorem 16.11 it remains to transition to the simpler parametrization, using only 
two pai'ameters jl G 2/i^] and N > N-p, where Np is a prior-dependent constant. 

Let us define the parameters for both stages of the algorithm as follows. Define paramefer C for the 
sampling stage as C = /)/6, and set the remaining parameters k*, L for the sampling stage and k, 9 for the 
racing stage be equal to N. To define Np, consider the thresholds for k* ,Lm Lemma 1631 with C = 
and the thresholds for k, 6 in Lemma [631 with any fixed r G (0,1), and lef Np be the largest of these four 
thresholds. This completes the algorithm’s specification. 

It is easy to see that both stages are BIC. (For the sampling stage, this is because Pr {S) is monotonically 
non-decreasing in parameter C.) Further, the sampling stage lasts for f{N) = N + N^{m — 1) rounds. So 
the regret bound in the theorem follows from that in Lemma [A31 


7 Extensions: contextual bandits and auxiliary feedback 

We now extend the black-box reduction result in two directions: contexts and auxiliary feedback. Each agent 
comes with a signal {context), observable by both the agent and the algorithm, which impacts the rewards 
received by this agent as well as the agent’s beliefs about these rewards. Further, after the agent chooses an 
arm, the algorithm observes not only the agent’s reward but also some auxiliary feedback. As discussed in 
Introduction and Related Work, each direction is well-motivated in the context of BIC exploration, and is 
closely related to a prominent line of work in the literature on multi-armed bandits. 

7.1 Setting: BIC contextual bandit exploration with auxiliary feedback 

In each round t, a new agent arrives, and the following happens: 

• context xx G A” is observed by both the algorithm and the agent, 

• algorithm recommends action at £ A and computes prediction cpt (invisible to agents), 

• reward rt{at) G M and auxiliary feedback ft{at) G are observed by the algorithm. 

The sets A, A, T are fixed throughout, and called, resp., the context space, the action space, and the feedback 
space. Here rtf) and ftf) are functions from the action space to, resp., M and T. 

We assume an IID environment. Formally, the tuple [xt £ N] : A —)• M; ft:A^T)is chosen 
independently from some fixed distribution 'F that is not known to the agents or the algorithm. 

There is a common prior on 'F, denoted V, which is known to the algorithm and the agents. Specifically, 
we will assume fhe following paramefric structure. The context xt is an independent draw from some fixed 
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distribution 2?x over the context space. Let fXa,x denote the expected reward corresponding to context x 
and arm a. There is a single-parameter family of reward distributions, parameterized by the mean /r. 
The vector of expected rewards fl = ■ a G A, x G X) is drawn according to a prior V^. Conditional 

on fl and the context x = xt, the reward rt{a) of every given arm a = at is distributed according to 
Similarly, there is a single-pai'ameter family of distributions Pfb(-) over the feedback space, so 
that conditional on fl and the context x = xt, the feedback ft{a) for every given arm a = at is distributed 
according to T’fb(/ra,x)- In what follows, we will sometimes write /i(a, x) instead of ^a,x- 

Recall that the planner’s recommendation now depends on the observed context. The recommendation 
in round t is denoted with If, where a; = xt is the context. 

The incentive constraint is now conditioned on the context: 

Definition 7.1 (Contextual BIC). A recommendation algorithm is Bayesian incentive-compatible (BIC) if 

M[pa,x\It = a] > ^SLxE[pa',x\It = G [T], Vx G X,\/a G A. 

’ a'^A ' 


Performance measures. Algorithm’s performance is primarily measured in terms of contextual regret, 
defined as follows. A policy vr ; A” —)• A is a mapping from contexts to actions. The number of possible 
policies increases exponentially in the cardinality of the context space. Since in realistic applications the 
context space tends to be huge, learnin g over the space of al l possi ble policies is usually intractable. A 
standard way to resolve this (following iLangford and Zhang! (1200711 1 is to explicitly restrict the class of 
policies. Then the algorithm is given a set If of policies, henceforth, the policy set, so that the algorithm’s 
performance is compared to the best policy in If. More specifically, Bayesian confexfual regref of algorifhm 
A relative fo the policy class If in the first t rounds is defined as 


RA,n{t) 


E 


f • sup E [//(7r(x), x)] — E 
Tren A 


■ t 

.s=l 


( 22 ) 


As in Section [21 we will also provide performance guarantees in terms of average rewards and in terms 
of the quality of predictions. 

Discussion. In the line of work on contextual bandits with policy sets, a typical algorithm is in fact a family 
of algorithms, parameterized by an oracle that solves a particular optimization problem for a given policy 
class If. The algorithm is then oblivious to how this oracle is implemented. This is a powerful approach, 
essentially reducing contextual bandits to machine leai'ning on a given data set, so as to take advantage of 
the rich body of work on the latter. In particular, while the relevant optimization problem is NP-hard for 
most policy classes studied in the literature, it is often solved very efficiently in practice. 

Thus, it is highly desirable to have a BIC contextual bandit algorithm which can work with different 
oracles for different policy classes. However, an explicit design of such algorithm would necessitate BIC 
implementations for particular oracles, which appears very tedious. A black-box reduction such as ours 
circumvents this difficulty. 

A note on notation. We would like to re-order the actions according to their prior mean rewards, as we did 
before. However, we need to do it carefully, as this ordering is now context-specific. Lef Pax — ^[Pa,x] 
be fhe prior mean reward of aim a given confexf x. Denofe wifh cr(x, i) G A fhe f-fh ranked aim for x 
according fo fhe prior mean rewards: 


,,o > //^ > > lA 

r^a{x,l), X — r^cr(x,2),fc — * * * — a {x ,m), X ^ 
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where m is the number of arms and ties are broken arbitrarily. The arm-rank of arm a given context x is the 
i such that a = a{x, i). We will use arm-ranks as an equivalent representation of the action space: choosing 
arm-rank i corresponds to choosing action a{x, i) for a given context x. 

A rank-sample of arm-rank i is a tuple (x, a, r, /), where x is a context, a = (j(x, i) is an arm, and r, / 
are, resp., the reward and auxiliary feedback received for choosing this arm for this context. Unless the x is 
specified explicitly, it is drawn from distribution Px- 

7.2 The black-box reduction and provable guarantees 

Given an arbitrary algorithm A, the black-box reduction produces an algorithm A^^. It proceeds in two 
stages: the sampling stage and the simulation stage, which extend the respective stages for BIG bandit 
exploration. In particular, the sampling stage does not depend on the original algorithm A. 

The main new idea is that the sampling stage considers arm-ranks instead of aims: it proceeds in phases 
z = 1, 2, ... , so that in each phase i it collects samples of aim-rank i, and the exploit arm is (implicitly) the 
result of a contest between arm-ranks j < i. Essentially, this design side-steps the dependence on contexts. 

Each stage of the black-box reduction has two parameters. A:, L S N. While one can set these parameters 
separately for each stage, for the sake of clarity we consider a slightly suboptimal version of the reduction in 
which the parameters are the same for both stages. The pseudo-code is given in Algorithm]^ The algorithm 
maintains the “current” dataset S and keeps adding various rank-samples to this dataset. The “exploit arm” 
now depends on the current context, maximizing the posterior mean reward given the current dataset S. 


ALGORITHM 6: Black-box reduction with contexts, auxiliary feedback and predictions. 
Parameters: k,L S N; contextual bandit algorithm A 


// Sampling stage: obtains k samples from each arm-rank 

1 Initialize the dataset: 5 = 0; 

2 Eor each context x € A: denote a* (5) = argmaXagA]E[/ia,a;|5]; 

3 Recommend aim-rank 1 to the first k agents, add the rank-samples to 5 ; 
for phase i = 2 to m do 

// each phase lasts KL rounds 

4 Erom the set P of the next kL agents, pick a set Q of k agents uniformly at random; 

5 Eor each agent p ^ P — Q, recommend the “exploit arm” a*^(5); 

6 Eor each agent in Q, recommend arm-rank i\ 

1 Add to 5 the rank-samples from all agents in Q 

end 


8 

9 

10 

11 

12 

13 

14 


// Simulation stage (each phase lasts L rounds) 

foreach phase n = 1, 2, ... do 

Pick one agent p from the set P of the next L agents uniformly at random; 
Eor agent p: send context Xp to algorithm A, get back arm an and prediction 
recommend arm a„ to agent p\ 

Eor every agent t £ P \ {p}: recommend the “exploit arm” 0*^(5); 

Eor every agent t £ P in phase n > 2: record prediction (l)n-u 
Return the rank sample from agent p to algorithm A', 

Add to 5 the rank-samples from all agents in this phase; 


end 
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The black-box accesses the common prior via the following two oracles. First, given a context x, rank all 
actions according to their prior mean rewards 3 ,. Second, given a context x and a set S of rank-samples, 
compute the action that maximizes the posterior mean, i.e., argmaXagAlE[/ra,a;|<S']- 

As before, we need to restrict the common prior V to guarantee incentive-compatibility. Our assumption 
talks about posterior mean rewards conditional on several rank-samples. For a particular context x and arm- 
rank i, we are interested in the smallest difference, in terms of posterior mean rewards, between the arm-rank 
i and any other arm. We state our assumption as follows: 


(P5) For a given arm-rank i S [m] and parameter k £ N, let be a random variable representing ki > k 
rank-samples of arm-rank i. Given context x, denote 


Xf- = min 

iiyj jX) / / 

^ ^ arms a^a[x,t) 


E 


f^(T(x,i),x /^a,a; 1-^15 • • • ; 


There exist prior-dependent constants kp < oo and T'p^p^p > 0 such that 

A {Xm > ’■I-) 2 

for any k > k'p, any arm-rank i £ [m], and j £ {i — 1 , m}. 


The analysis is essentially the same as before. (For the sampling stage, the only difference is that the 
whole analysis for any single agent is done conditional on her context x and uses arm-ranks instead of arms.) 

Property (F|5l) is used with j = i — 1 for the sampling stage, and with j = m for the simulation stage. 
For each stage, the respective property is discovered naturally as the condition needed to complete the proof. 

Theorem 7.2 (BIG). Assume the common prior V satisfies Property f/jj]) with constants {kp, r-p, pp). The 
black-box reduction is BIC (applied to any algorithm) if its parameters satisfy k > kp and L > Lp, where 
Lp is a finite prior-dependent constant. In particular, one can take 

ifi - //O 

r 1 , Pa,x Pa'x 

Lp = 1 -|- sup max - 

Tp ■ pp 

The performance guarantees are similar to those in Theorem [53] 

Theorem 7.3 (Performance). Consider the black-box reduction with parameters k, L, applied to some al¬ 
gorithm A. The sampling stage then completes in c = rriLk k rounds, where m is the number of arms. 

(a) Let be the average Bayesian-expected reward of algorithm A in rounds [ti, T 2 ]. Then 

U^ic(c -\- l,c -\- t) > (7^(1, [r/Lj) for any duration T. 

(c) Let RA,u{t) be the Bayesian contextual regret of algorithm A in the first t rounds, relative to some 
policy class Ft. Then: 

RA^^,n{l) < L ■ RA,u{[t/L\) + c- E maxpa- min• 

\_a^A aeA 

In particular, if pa,x £ [0,1] and the original algorithm A achieves asymptotically optimal Bayesian 
contextual regret 0{y^t log |n|), then so does (assuming L is a constant). 

(c) The prediction returned by algorithm in each round t > c + L has the same distribution as that 
returned by algorithm A in round [{t — c)/L\. 
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8 Properties of the common prior 


Property (FfH), our main assumption on the common prior for the black-box reduction, is stated in a rather 
abstract manner for the sake of generality (e.g., so as to allow correlated priors), and in order to avoid 
some excessive technicalities. In this section we discuss some necessary or sufficient conditions. First, we 
argue that conditioning on more samples can only help, i.e. that if a similar property holds conditioning 
on less samples, it will still hold conditioning on more samples. This is needed in the incentives analysis 
of the simulation stage, and to show the “almost necessity” for two arms. Second, we exhibit some natural 
sufficient conditions if the prior is independent across arms, specifically prove Lemma [5^ 

8.1 Conditioning on more samples can only help 

Fix arm i and vecfor o = (oi , ... , am) S where m is fhe number of arms. Lef be a random 
variable represenfing fhe firsf Oj samples from each arm j. Denote 

= min E [pi - I Ag] . 

arms j ^ i 

We are interested in the following property: 

(P6) There exist prior-dependent constants Tp, pp > 0 such that Pr {Xi^^ > Tp) > pp. 

We argue that Property (F® stays true if we condition on more samples. 

Lemma 8.1. Fix arm i and consider vectors a,b £ N™ such that aj < bj for all arms j. If Property f/© 
holds for arm i and vector a, then it also holds for arm i and vector b. 

Proof For the sake of contradiction, assume that the property for this arm holds for a, but not for b. Denote 
X = and y = X g. 

Then, in particular, we have Pr {Y > l/l^) < 1/^^ for each £ G N. It follows that Pr (y < 0) = 1. 
(To prove this, apply Borel-Cantelli Lemma with sets Sg, = {y > 1/^^}, noting that Xl^Pr {Sf) is finite 
and limsup£ Si = {Y > 0}.) 

The samples in are a subset of those for Ag, so (t(A^) C ( 7 (Ag), and consequently X = E[y|A^]. 
Since y < 0 a.s., it follows that X < 0 a.s. as well, contradicting our assumption. ■ 

Therefore, Property (flUl implies Property (F|3]) with the same constant kp ; this is needed in the incentives 
analysis of the simulation stage. Also, Property (F{T1 i implies Property (I© with the same kp, so the latter 
property is (also) “almost necessary” for two arms, in the sense that it is necessary for a strongly BIC 
algorithm (see Lemma |4TI) . 

8.2 Sufficient conditions: Proof of Lemma 15.21 

For each arm i, let be a random variable that represents n samples of arm i, and let Y^ = E[/Zj|5'f] be 
the posterior mean reward of this arm conditional on this random variable. 

Since the prior is independent across arms. Equation (|9l) in Property (fjUl can be simplified as follows: 

Pr - max jy/", ... , Y^_i; > Tp'j > pp VA; > kp,i G A. (23) 

Fix some rp such fhaf 0 < rp < minj^AiPj — Mj+i)- Note that the right-hand side is strictly positive 
because the prior mean rewards are all distinct (condition (iii) in the lemma). 
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Since {ni : i G A) are mutually independent, and for fixed k each S^, z G A is independent of 
everything else conditional on fXi, we have that {Y^ : z G are mutually independent. Thus, one can 
rewrite Equation (|2^ as follows: 

Pr — T-p^ > Pv Vfc > kp, i € A. 

j<i 

Therefore it suffices fo prove fhaf Tp as fixed above, and some prior-dependenf consfanf kp, for any fwo 
arms j < iwe have: 

Pr (^y/ <p^i-Tp^>q>0 VA; > kp. (24) 

where q is some prior-dependenf consfanf. Observe fhaf evenf {YJ^ < fz® — Tp} is implied by fhe even! 
{Yj^ — pj < e} n {pj < Pi — Tp — e}, where e > 0 is a prior-dependenf consfanf fo be fixed lafer. Invoking 
also fhe union bound we gel: 

Pr (^y/ <p°- Tp^ > Pr (^(y/ - Pj < e) n {pj < /z° -Tp- e)) 

> Pr {pj < Pi -Tp-e) - Ft (Yj" - pj > ej . (25) 

By the full support assumption (condition (ii) in the lemma), the first summand in (1251) is bounded from 
below by some constant p > 0 when p^ — Tp — e > b. The latter is achieved when r-p, e are sufficiently 
small (because /z? > a, also by the full support assumption). Fix such Tp, e from here on. 

It suffices fo bound fhe second summand in (1251) from above by p/l. This follows by a convergence 
properly (10 which we articulate below. By fhis properly, there exists a prior-dependent constant k* such 
that Pr ^Yj^ — pj > < /o/2 for any k > k*. Thus, we have proved Equation (l24l) for kp = k* and 

q = /o/2. This completes the proof of the lemma. 

It remains to state and prove the convergence property used above. 


(P7) For each aim z, Yj^ converges to pi in probability as A: —)• oo. That is, 
VzG^ Ve,5>0 3A:*(e,(5) < oo ^k>k*{e,5) Pr 


Yt - 


>e] <5. 


This property follows from the convergence theorem in iDoobI (119491) . because the realized rewards are 
bounded (con dition (iy) in th e lemma). For a more clear exposition of the aforementioned theorem, see e.g.. 
Theorem 1 in iGhosall (11996h . There the sufficient condition is that the parameterized reward distribution 
Vi(pi) of each arm z is pointwise dominated by an integrable random variable; the latter is satisfied if fhe 
realized rewards are bounded. 


9 Conclusions and open questions 

We have resolved fhe asymptotic regrel rales achieveable wilh incenlive compalible exploration for a con- 
slanl number of aclions, as oullined in fhe Inlroduclion. Focusing on regref minimization wilh a consfanf 
number of actions, we provided an algorifhm wilh asympfofically optimal regret for BIC bandit exploration, 
and a general “black-box” reduction from arbitrary bandit algorithms to incentive compatible ones that 
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works in a very general explore-exploit setting and increases the regret by at most a constant multiplicative 
factor. 

This paper sets the stage for future work in several directions. First, the most immediate technical ques¬ 
tion left open by our work is whether one can achieve Bayesian regret with constants that do not depend 
on the prior. Second, one would like to generalize the machine learning problem to a large (super-constant) 
number of actions, possibly with a known structure on the action set^j The latter would require handling 
agents’ priors with complex correlations across actions. Third, the mechanism design setup could be gen¬ 
eralized to incorporate constraints on how much information about the previous rounds must be revealed to 
the agents. Such constraints, arising for practical reasons in the Internet economy and for legal reasons in 
medical decisions, typically work against the information asymmetry, and hence make the planner’s prob¬ 
lem more difficult. Fourth, while our detail-free result does not require the algorithm to have full knowledge 
of the priors, ideally the planner would like to start with little or no knowledge of the priors, and elicit the 
necessary knowledge directly from the agents. Then the submitted information becomes subject to agent’s 
incentives, along with the agent’s choice of action. 
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Appendix A: A simpler detail-free algorithm for two arms 


The detail-free algorithm becomes substantially simpler for the special case of two arms. For the sake 
of clarity, we provide a standalone exposition. The sampling stage is similar to Algorithm [H except that 
the exploit arm is chosen using the sample average reward of arm 1 instead of its posterior mean reward, 
and picking arm 2 as the exploit arm only if it appears better by a sufficient margin. The racing stage is a 
simple “race” between the two arms: it alternates the two aims until one of them can be eliminated with 
high confidence, regardless of fhe prior, and uses fhe remaining arm from fhen on. 

Below we provide fhe algorifhm and analysis for each sfage separately. The corresponding lemmas can 
be used to derive Theorem 16. II for two arms, essentially same way as in Section lOl 

Compared to the general case of m > 2 arms, an additional provable guarantee is that even computing 
the threshold Np exactly does not require much information: it only requires knowing the prior mean 
rewards of both arms, and evaluating the CDFs for pi and pi — p 2 at a single point each. 

A.l The sampling stage for two arms 

The initial sampling algorithm is similar in structure to that in Section HJ but chooses the “exploit” arm a* 
in a different way (after several samples of arm 1 are drawn). Previously, a* was the aim with the highest 
posterior mean given the collected samples, whereas now the selection of a* is based only on the sample 
average pi of the previously drawn samples of arm 1 . 

We use Pi instead of the posterior mean of arm 1, and compare it against the prior mean p^. (Note 
that p 2 is also the posterior mean of arm 2, because the prior is independent across arms.) We still need 
K[pi — Pj\a* = i] to be positive for all arms i,j, even though pi is only an imprecise estimate of the 
posterior mean of pi. For this reason, we pick a* = 2 only if p 2 exceeds pi by a sufficient margin. Once a* 
is selected, it is recommended in all of the remaining rounds, except for a few rounds chosen u.a.r. in which 
aim 2 is recommended. See Algorithm |7] for the details. 


ALGORITHM 7: The sampling stage: samples both arms k times. 

Parameters: k,L and C £ (0,1). 

1 Let Pi be the sample average of the resp. rewards; 

2 if Pi < p 2 — C then a* = 2 else a* = 1; 

3 From the set P of the next L ■ k agents, pick a set Q of k agents uniformly at random; 

4 Every agent p £ P — Q is recommended arm a* ; 

5 Every agent p £ Q is recommended arm 2 


We prove that the algorithm is BIC as long as C « | P 2 , and parameters fc*, L are larger than some prior- 
dependent thresholds. We write out these thresholds explicitly, parameterizing them by the ratio A = C(P 2 - 

Lemma A.l. Consider BIC bandit exploration with two arms. Algorithm DetailEreeSampleBIC (A:) 
with parameters {C, k*,L) completes in Lk -|- max(A:, k*) rounds. The algorithm is BIC if Property (/0 
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A — CjH 2 G (0) §)) 

k*>2{\■^ll)-^ \Og^y, 


(26) 


holds and > 0, and the parameters satisfy 

4)-^ 

L>'^ + ^ {hi -14) > 
where /3(A) = A • /i 2 ' {di/d 2 ^ 1 “ ^)- 

Proof. Let Z = p 2 — Pi- Consider an agent p > k. As in the proof of Lemma IL2I it suffices to show that 
E[Z\Ip = 2] Pr {Ip = 2) > 0. 

Denote £i = {pi < P 2 ~ As in the proof of Lemma 14.21 

E[Z\Ip = 2] Pr {Ip = 2) = E[Z\£i] Pr {£ 1 ) • (l - i) + E[Z] • P 
Thus for the algorithm to be incentive compatible we need to pick L: 

It remains to lower-bound the quantity E[Z|fi] Pr {£ 1 ). Since the means pi are independent, 
E[Z|^i]Pr(^i) =E[/i°-/ri|/r°-/li > C] Pr (/r° -/li > C) . 

Denote X = p 2 ~ Pi = P 2 ~ Pi- Observe that Xk is the sample average of k i.i.d. samples from 

some distribution with mean X. Then the quantity that we want to lower bound is: 


E[Z\£i] Pr (^i) = E[X\xk > C] Pr {xk > C). 


(28) 


We will use Chernoff-Hoeffding Bound to relate the right-hand side with quantities that are directly related 
to the prior distribution P. More precisely, we use a corollary: Lemma 1X31 which we state and prove in 
Appendix IA.3I 

If we apply Lemma 1X31 with C = A • /ig and ( = k = ^, then for k > 2 (A • the 

right-hand side of Equation (|2^ is at least |/3(A). Therefore, E[Z|fi] Pr {£ 1 ) > |/3(A). Plugging this 
back into Equation (l27l) . it follows that the conditions (l26l) suffice to guarantee BIC. ■ 


A.2 The racing stage for two arms 

The racing phase alternates both arms until one of them can be eliminated with high confidence. Specifically, 
we divide lime in phases of Iwo rounds each, in each phase selecl each arm once, and after each phase check 
whelher |/i” — /I 21 , is larger lhan some Ihreshold, where pf is Ihe sample average of arm i al Ihe beginning 
of phase n. If fhal happens Ihen fhe arm wilh fhe higher sample average “wins fhe race”, and we only pull 
Ihis arm from fhen on. The fhreshold needs fo be sufficienfly large fo ensure lhaf fhe “winner” is indeed fhe 
besf arm wilh very high probabilify, regardless of fhe prior. The pseudocode is in Algorilhm[ 8 ] 

To guarantee BIC, we use fwo parameters: fhe number of samples k collected in fhe sampling slage, and 
parameter 9 inside fhe decision Ihreshold. The k should be large high enough so fhal when an agenl sees a 
recommendalion for arm 2 in fhe racing stage there is a significant probability that it is due to the fact that 
arm 2 has “won the race” rather than to exploration. The 9 should be large enough so that the arm that has 
“won the race” would be much more appealing than the losing arm according to the agents’ posteriors. 
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ALGORITHM 8: BIC race for two arms. 


Input : parameters k £ N and 6 * > 1; time horizon T. 

Input : k samples of each arm i, denoted rj, ... , r^. 

1 Let r’l be the sample average for each aim i\ 

2 Split remainder into consecutive phases of two rounds each, starting from phase n = k', 

while \fi^ - fq\ < do 

3 The next two agents are recommended both arms sequentially; 

4 Let rf be the reward of each arm i = 1, 2 in this phase, and 

5 n = n + 1; 


E n 

t=i 


end 


6 For all remaining agents recommend a* = max^gj]^ 2 } AS 


Lemma A.2 (BIC). Assume two arms. Fix an absolute constant r £ (0,1) and let 6 t = Pr {^2 ~ /^i > t) 
Algorithm^is BIC if Property f/0) holds, and the parameters satisfy 6 > Or and k> 6“^ log T. 

The regret bound does not depend on the parameters; it is obtained via standard techniques. 

Lemma A. 3 (Regret). Algorithm\^with any parameters k £N and 6 > 1 achieves ex-post regret 

R{T) < (29) 

\hi - F2I 

Denoting A = \p.i — P 2 \ and letting n* be the duration of the initial sampling stage, the ex-post regret for 
each round t of the entire algorithm (both stages) is 

R(t) < min ^n*A + ^ ^ (^2n*, i/Sf log(T0)^ . (30) 

In the remainder of this section we prove Lemma lAdl and Lemma lA^ Denote Zn = P 2 ~ Ai let 

Cn = be the decision threshold for each phase n > fc in Algorithm [ 8 ] For ease of notation we will 

assume that even after the elimination at every iteration a sample of the eliminated arm is also drawn, but 
simply not revealed to the agent. Let Z = p 2 — bi- According to Chernoff-Hoeffding Bound, the following 
event happens with high probability: 


C = {Vn > k : \Z — Zn\ < Cn} ■ (31) 

We make this precise in the following claim: 

Claim A.4. Pr (^C|Z) < 

Proof By Chernoff-Hoeffding Bound and the union bound, for any Z, we have: 

Pr (-C|Z) < ^ Pr {\Z - znl > c„) < ^ < T ■ < ^. ■ 

n>n* n>n* 

Generally we will view -iC as a “bad event”, show that the expected “loss” from this event is negligible, and 
then assume that the “good event” C holds. 
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Proof of Lemma \A^ Let n* = 9^ logT. Fix phase n > n*, and some agent p in this phase. We will show 
that E[Z\Ip = 2] Pr (Ip = 2) > 0. 

By Claim I a! 41 we have Pr {-<C\Z) < ^ < l/9r- Therefore, since Z > —1, we can write: 

E[Z\Ip = 2] Pr {Ip = 2) = E[Z\Ip = 2,C] Pr {Ip = 2,C)+ E[Z\Ip = 2, ^C] Pr {Ip = 2, -C) 

> E[Z\Ip = 2 ,C]Pt {Ip = 2,C) - e-\ 

Now we will upper bound the first term. We will split the first integral into cases for the value of Z. 
Observe that by the definition of we have Cn* < 9~^. Hence, conditional on C, if Z > r > 2c„», then we 
have definitely stopped and eliminated arm 1 at phase n = k > n*, since Zn > Z — Cn > Cn- Moreover, if 
Z < —2cn, then by similar reasoning, we must have already eliminated arm 2, since Zn < Cn- Thus in that 
case event Ip = 2 cannot occur. Moreover, if we ignore the case when Z S [0, r), then the integral can only 
decrease, since the integrand is non-negative. Since we want to lower bound it, we will ignore this region. 
Hence: 


E[Z\Ip = 2,C]Pi {Ip = 2,C) > rPr(C,Z>r)-2-c„Pr(C,-2c„ < ^ < 0) 

> r Pr (C, Z > r) — 2 • c„ 

Using Chernoff-Hoeffding Bound, we can lower-bound the probability in the first summand of the above: 


Pr (C, Z>t) = 


Pt{C\Z > r)Pr(Z > r) > 



> (l-i) •Pr(Z>r) > |Pr(Z>r). 


• Pr (Z > t) 


Combining all the above inequalities yields that E[Z\Ip = 2] Pr {Ip = 2) > 0. ■ 

Proof of Lemma \AJ\ The “Chernoff event” C has probability at most 1 — ^ by Claim lA^ so the expected 
ex-post regret conditional on is at most 1. For the remainder of the analysis we will assume that C holds. 

Recall that A = |/ii — /r 2 |- Observe that A < \zn\ + Cn- Therefore, for each phase n during the main 
loop we have A < 2 • c^, so n < ^ . Thus, the main loop must end by phase 

In each phase in the main loop, the algorithm collects regret A per round. Event C guarantees that 
the “winner” a* is the best aim when and if the condition in the main loop becomes false, so no regret is 
accumulated after that. Hence, the total regret is R{T) < as claimed in Equation (l29l) . 

The corollary (1^ is derived exactly as the corresponding corollary in Eemma lOl ■ 


A.3 A Chernoff-Hoeffding Bound for the proof of Lemma [AJ] 

In this section we state and prove a version of Chernoff-Hoeffding Bound used in the proof of Eemma lATTl 


Lemma A.5. Let X > 1 be a random variable. Let xi,..., G [0, 1] be Ltd. random variables with 
E[xj|X] = X. Let Xk = p Ylt=i Then for any ^ (0,1), if 

, ^ -log(ic-(l-C)-C-Pr(A>(l + C)C')) 

2(2 • C2 
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then: 


E[X|xfc > C] • Pr (xfc > C) > (1 - C) • C • (1 - K - k( 1 - C)) • Pr (X > (1 + C)^). 

We first prove the following Lemma. 

Lemma A. 6 . Let xi, ... ,Xk be i.i.d. random variables with E[xi|X] = X, and X > —1. Let Xk = 
^ Xf. Then for any C > 0, e < C and 6 > O.' 

E[X|xfc > C] • Pr {xk <C)>{C-e)-(l- • Pr (X > C + <5) - (32) 

Proof. Let £c-e be the event X > C — e and £c be the event that > (7. By the Chernoff-Hoeffding 
Bound observe that: 

Pr {-^£c-e, £c) < Pr (|X - Xfcl > e) < (33) 

Moreover, again by Chernoff-Hoeffding Bound and a simple factorization of the probability: 

Pr (^£c-e,£c^ > Pr (^£c+S,£c^ = Pr (^£c \ £c+S^ ■ Pr (^^c-i-< 5 ) 

= (1 - Pr (xfc < (7 I X > C + 5)) • Pr (X > (7 + (5) 

> (1 - Pr (xfc - X < 5 I X > (7 + 5)) • Pr (X > (7 + (5) 

> •Pr(X>(7 + 5) 

We can now break apart the conditional expectation in two cases and use the latter lower and upper bounds 
on the probabilities of each case: 

E[X|4]Pr (ic) = E[X\£c-e,£c]'P^ {£c-e,£c) +E[X\^£c-e,£c] • Pr 
> {C-e)(l- • Pr (X > (7 + (i) - 1 • 

Where we also used that X > — 1. This concludes the proof of the lemma. ■ 

Proof of Lemma IA.5I Now, if A: > ~ ^°g(^'(r-C)^^^-^^^(^>(r-i-C)C)) appplying the above Lemma for 
6 = e = ( ■ C vte get: 

E[X|4] Pr (ic) > (1 - C)C^(1 - K • (1 - 0) Pr (A > (1 + QC) - «(1 - C)CPr (X > (1 + C)C) 

> (1 - C)C(1 -k-k{1- 0) Pr (X > (1 + C)C) 

Which concludes the proof of the Lemma. ■ 
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